Stable Diffusion

Stable Diffusion = Laten Diffusion Model

VAE
- Training
  - Encoder 가 image를 low dimension latent representation으로 바꾸고, 이게 U-Net에 들어감
- Inference
  - Decoder는 latent representation을 image로 변환
U-Net
- Encoder
  - ResNet 구조로 저해상도로 Downsampling
- Decoder
  - ResNet 구조로 고해상도로 Upsampling
  - Noise residual을 예측함 ( 그니까 현재 시점에 원본 이미지에서 얼마만큼의 Noise가 더해졌는 지를 계산함 )
  - Cross-attention layers 를 통해 text-embedding에 conditioning 시킬 수 있음.
    - Cross-attention layers 는 encoder, decoder 모두 ResNet blocks 사이에 들어감
Text Encoder
- text input을 U-Net에서 이해할 수 있는 embedding으로 변환
- transformer-based encoder that maps a sequence of input tokens to a sequence of latent text-embeddings
- Stable Diffusion에서는 embedding을 새로 학습하는 대신 CLIP의 pretrained된걸 갖다씀

GPT2 \| Language Models are Unsupervised Multitask Learners (0)	2023.06.30
GPT1 \| Improving Language Understandingby Generative Pre-Training (0)	2023.06.29
REGULARIZED AUTOENCODERS FOR ISOMETRIC REPRESENTATION LEARNING (0)	2023.06.26
Fast-RCNN 이해하기 (0)	2023.06.26
R-CNN 이해하기 (0)	2023.06.26