GLIDE recipe for filtering harmful images before training image generation model
The GLIDE paper by OpenAI was probably the first instances of text-guided image generation that blew me away. In Appendix F1, they detail a recipe for filtering harmful images from the training data. This process was recycled later for training DALLE-2.
Filtering human images from the dataset
In order to prevent misuse or liability problems, they wanted to train a model that was incapable of generating images of any human. So, they wanted to remove all images of humans from the training data.
They accomplish this by training a task-specific human vs not-human binary classifier, then using this binary classifier to detect the images they want to throw out. They prioritized recall over precision here, meaning they really wanted to make sure they filtered out all the human images, even if there were a few false positives.
The recipe:
- collect several thousand boolean human vs not-human labels for random samples
- resize smallest edge to 224 pixels
- take 3 crops at endpoints and middle along longer side of the image
- extract embedding vectors for each crop using a pre-trained CLIP ViT-B/16. They then mean-pool the three embedding vectors into a single vectors.
- They don't explain why they use mean-pooling. One paper, "Mean Embeddings with Test-Time Data Augmentation for Ensembling of Representations" argues that mean-pooling a bunch of augmented versions of the same image is better since models are not truly 100% invariant to augmentations. I don't get why they didn't use rotations or flips instead, since crops could completely remove the human if the image is zoomed out.
- fit an SVM with RBF kernel to the resulting (mean-pooled-vector, label) pairs.
- They then lower the bias (ML) of the SVM to boost recall at the expense of precision. A possible snippet to do this is:
def new_predict(clf, new_bias, embedding):
return np.sign(clf.decision_function(embedding) - new_bias)
The resulting SVM had a less than 1% false negative rate.
Using a ViT-B/32 CLIP feature extractor had less recall for humans in low-light or obstructed settings. The ViT-B/16 backbone had no problems, since it has higher latent dimensionality. Smaller patch sizes -> more patches -> more input tokens -> more hidden state resolution.
Filtering violent images from the dataset
Filtering violent images from the dataset was a similar process:
- used CLIP to search image dataset for violence-related words. They label a few hundred positives and negatives.
- train an SVM similar to above.
- label samples near decision boundary of this SVM to obtain a few hundred more samples.
- iterate a few more times
- tune the bias of the final SVM to result in less than 1% false negative rate (to prioritize recall over precision)