Mark Henry
Li et al 2022 presents Diffusion-LM, a diffusion text model. Unlike autoregressive models which genererate text one token at a time, diffusion models start with a sequence of random noise and iteratively refines or "denoises" it into desirable tokens. (Google has a great explanation and demo at Gemini Diffusion's webpage).
Although diffusion models still lag behind autoregressive models in cogency and utility, in the process of reproducing the model I learned how a diffusion text model works. My BERT-based diffusion model was trained on WikiText and training took place on my own RTX 5070.
My model shows desirable qualities like good denoising fidelity, graceful degradation as noisiness is increased, and generation of recognizable tokens.
| Timestep | Noise % | Cosine Sim | Std | Quality |
|---|---|---|---|---|
| t=0 | 0.0% | 0.9983 | ±0.0002 | 🟢 Excellent |
| t=1 | 2.4% | 0.9982 | ±0.0001 | 🟢 Excellent |
| t=5 | 5.1% | 0.9984 | ±0.0001 | 🟢 Excellent |
| t=10 | 7.1% | 0.9981 | ±0.0001 | 🟢 Excellent |
| t=50 | 15.8% | 0.9967 | ±0.0003 | 🟢 Excellent |
| t=100 | 22.4% | 0.9956 | ±0.0005 | 🟢 Excellent |
| t=500 | 50.0% | 0.9888 | ±0.0016 | 🟢 Excellent |
| t=1000 | 70.7% | 0.9769 | ±0.0037 | 🟢 Excellent |
| t=1500 | 86.6% | 0.9486 | ±0.0097 | 🟢 Excellent |
| t=1900 | 97.5% | 0.7388 | ±0.0319 | 🟡 Good |
Final generated text: 'trump visits session tvo ability la when apparently worse defend construction pond having geographic wheelhood swedish magazine explosion within girl ab forget cruz developers uniqueulation associated ashley mcourown 1920 interviewged anti septfula marcus supports utility springfield destructiveoga, change, university unique cell system twitter thinkingons en attack option hopefully children 40 consumers entrance'
The central idea of Diffusion-LM is the loss function, which has three terms. This loss is applied across the embeddings and the transformer layers during training.
So two loss components control the "spread" of the embeddings, and one component rewards correct denoising.
The final important concept is that latents in continuous space are rounded off to the nearest token embedding in the final step of denoising. This bridges the gap between the continuous and discrete domains.
My code is available at https://github.com/mark-henry/text-diffusion.