Fine tuning is the common practice of taking a model which has been trained on a wide and diverse dataset, and then training it a bit more on the dataset you are specifically interested in. This is common practice on deep learning and has been shown to be tremendously effective all manner of models from standard image classification networks to GANs. In this example we'll show how to fine tune Stable Diffusion on a Naruto anime character dataset to create a text to image model which makes custom Naruto inspired images based on any text prompt.
If you want more details on how to train your own Stable Diffusion variants, see Justin Pinkney's example on how we made the text-to-pokemon model at Lambda.
If you're just after the model, code, or dataset, see:
If you just want to generate your own Naruto-like images, try the live text-to-naruto demo here!
Here are some examples of the sort of outputs the trained model can produce, and the prompt used:
Put in a text prompt and generate your own Naruto style image!
We find that prompt engineering does help produce compelling and consistent Naruto style portraits. For example, writing prompts such as 'person_name ninja portrait' or 'person_name in the style of Naruto' tends to produce results that are closer to the style of Naruto character with the characteristic headband and other elements of costume.
Here are a few examples of prompts with and without prompt engineering that will illustrate that point.
To run model locally:
!pip install diffusers==0.3.0
!pip install transformers scipy ftfy
import torch
from diffusers import StableDiffusionPipeline
from torch import autocast
pipe = StableDiffusionPipeline.from_pretrained("lambdalabs/sd-naruto-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "Yoda"
scale = 10
n_samples = 4
# Sometimes the nsfw checker is confused by the Naruto images, you can disable
# it at your own risk here
disable_safety = False
if disable_safety:
def null_safety(images, **kwargs):
return images, False
pipe.safety_checker = null_safety
with autocast("cuda"):
images = pipe(n_samples*[prompt], guidance_scale=scale).images
for idx, im in enumerate(images):
im.save(f"{idx:06}.png")
Trained on BLIP captioned Naruto images using 2xA6000 GPUs on Lambda GPU Cloud for around 30,000 step (about 12 hours, at a cost of about $20).
Trained by Eole Cervenka (@eoluscvka) after the work of Justin Pinkney (@Buntworthy)