Diffusion models: Everything you need to know

Here's every vision application Diffusion models were a game changer in 2022: image, text, video, 3D, and more!

Diffusion models: Everything you need to know
This week's iteration focuses on Diffusion models. Here's every vision application Diffusion models were a game changer in 2022: image, text, video, 3D, and more! We cover what they are and how they were used in all those amazing applications using lots of videos and featured articles. We will introduce most approaches and give a short explanation, but we invite you to watch or read the featured content for a complete understanding. We hope you enjoy it.

1️⃣ To start: What are Diffusion models?

Nothing's better for explaining something than a concrete example. Let's go over Diffusion models using one of the most popular models of the past few months: Stable Diffusion. Stable Diffusion is a powerful text-to-image model based on a recent technique called Latent Diffusion. This basically takes Diffusion and makes it more efficient but it stays the exact same process, so it will do fine for this explanation as it means it is also much more accessible to the "non-Google" entities, like us.

Diffusion models are iterative models that take random noise as inputs, which can be conditioned with a text, an image, or any modalities (types of inputs), so it is not completely random noise. It iteratively learns to remove this noise by learning what parameters the model should apply to this noise to end up with a final image. So the basic diffusion models will take random noise with the size of the image and learn to apply even further noise until we get back to a real image.

This is possible because the model will have access to the real images during training and will be able to learn the right parameters by applying such noise to the image iteratively until it reaches complete noise and is unrecognizable. Then, when we are satisfied with the noise we get from all images, meaning that they are similar and generate noise from a similar distribution, we are ready to use our model in reverse and feed it similar noise in the reverse order to expect an image similar to the ones used during training.

Learn more in my article about Stable Diffusion or in the video below!
Watch the video
Now that we've introduced what they are, let's dive into "why they are".
Diffusion models have been a game changer for many applications, especially in computer vision. Let's have a look at how exactly they disturbed our industry and to what extent they were applied to the research and could manage to improve upon the state-of-the-art (SOTA) results.
p.s. don't forget to follow me on Twitter to stay up-to-date with AI!

2️⃣ From Image Generation to Image Manipulation

Text-to-Image models like DALLE or stable diffusion are really cool and allow us to generate fantastic pictures with a simple text input. But would it be even cooler to give them a picture of you and ask it to turn into a painting? Imagine being able to send any picture of an object, person, or even your cat, and ask the model to transform it into another style like turning yourself into a cyborg or into your preferred artistic style or adding it to a new scene.

Basically, how cool would it be to have a version of DALLE we can use to photoshop our pictures instead of having random generations? Having a personalized DALLE, while making it much more simple to control the generation as “an image is worth a thousand words”. It would be like having a DALLE model that is just as personalized and addictive as the TikTok algorithm.

Well, this is what researchers from Tel Aviv University and NVIDIA worked on. They developed an approach for conditioning text-to-image models, like stable diffusion we just reviewed, with a few images to represent any object or concept through the words you will send along your images. Transforming the object of your input images into whatever you want! Learn more in the article or the video below!

Watch the video

3️⃣ From Images to Videos!

Meta AI published a model for video generation called make-a-video. In a single sentence: Make-A-Video generates videos from text. It’s not only able to generate videos, but it’s also the new state-of-the-art method, producing higher quality and more coherent videos than ever before!

You can see this model as a stable diffusion model for videos. Surely the next step after being able to generate images. This is all information you must’ve seen already on a news website or just by reading the title, but what you don’t know yet is what it is exactly and how they could adapt such a model to videos, adding a time dimension to images. Learn more in the article or the video below!

Watch the video

5️⃣ From Videos to 3D!

We’ve seen models able to take a sentence and generate images. Then, other approaches to manipulate the generated images by learning specific concepts like an object or particular style.

Fast forwards a few more weeks and Meta published the Make-A-Video model we just learned about, which allows you to generate a short video also from a text sentence. The results aren’t perfect yet, but the progress we’ve made in the field since last year is just incredible.

Now, we make another step forward.

Here’s DreamFusion, a new Google Research model that can understand a sentence enough to generate a 3D model of it. You can see this as a DALLE or Stable Diffusion but in 3D. 

It basically uses two models I already covered: NeRFs and one of the text-to-image models. In their case, it is the Imagen model, but any would do, like stable diffusion. Learn more about Dreamfusion in the article or in the video below!

Watch the video

6️⃣ What does Lauren, our AI Ethics expert have to say about diffusion models? (specifically about DALL•E here)

Despite having some choice words for OpenAI regarding their bias problems with CLIP, I think they’ve done a great job at mitigating bias at many points early on with DALL•E 2, and it seems to be working well. I especially appreciate the reweighting of the images when the team realized there was an imbalance reflected in the results. It often appears like filtering negative data is the ultimate bias mitigator, but OpenAI has proved you can (and should) go multiple steps further to reduce harm.

My greatest hope for DALL•E 2 is that its bias mitigation treatment sets a precedent for the level of care that should be shown in many other models, even when they’re not massive, widely used meme sources. Do the right thing even when no one is watching!

- AI Ethics segment by Lauren Keegan
Want to get into AI or improve your skills? Click here!
We are already at the end of this AI weekly digest! Thank you for thoroughly going through this iteration! I hope you enjoyed it and learned about Diffusion models. Stay tuned for more news and research related to computer vision and most certainly diffusion models with the CVPR 2023 deadline next week. Tons of new research is going to be released and I will be sure to cover them!