The Hidden Ally of Image Generation Models - CLIP and Advanced Prompt Strategies

When using image generation AI, you often encounter the term 'CLIP'. While many perceive it simply as a 'converter that transforms text into images', the role of CLIP is much deeper and more sophisticated. Especially with the emergence of the 'Dual CLIP' approach that utilizes two or more encoders to generate high-quality images, the strategies for crafting prompts are also evolving.

In this article, we will explore the basic concept of CLIP and how to optimize prompts in a dual encoder system.

1. What is CLIP?

CLIP (Contrastive Language-Image Pre-training) is a model developed by OpenAI which, as its name implies, is a multi-modal model that learns by contrasting 'language (text)' and 'images'.

The core function is to place images and text in the same 'latent space', which is a hypothetical vector space.

If the text "dog" is located at a specific spatial coordinate (vector),
the image of a "dog" is also trained to be near that coordinate.

Role in image generation models:

The important point is that CLIP itself is not a generator that creates images. Instead, it acts as a 'judge' or 'navigator' that evaluates how well the currently generated image aligns with the user's text prompt during the image creation process by a generative model (e.g. Stable Diffusion). The higher the CLIP score (the closer the text and image), the more the generative model determines that "this direction is correct" and enhances the details.

2. The Rise of Dual Encoders: Why Use Two Models?

Recently, high-performance models such as Stable Diffusion XL (SDXL) have adopted a 'dual text encoder' approach instead of a single CLIP. A representative combination is clip_l and t5xxl_fp16.

This is due to 'division of roles and specialization'.

clip_l (Visual/Keyword Matching Expert):
- This is the traditional CLIP text encoder.
- It is powerful in connecting visual concepts between images and texts.
- It is generally used to capture core keywords, styles, and compositions and other visual elements.
t5xxl_fp16 (Language/Context Expert):
- T5 is a large language model (LLM) developed by Google, which is much larger than the basic text encoder of CLIP.
- It understands complex sentence structures, context, and subtle relationships between words much more deeply than simple keywords.
- As a 'language expert', it grasps the detailed nuances of prompts.

By using two encoders together, it becomes possible to generate images that accurately reflect the meanings of complex, lengthy sentences (T5) and the core visual styles (CLIP-L).

3. Optimal Prompt Writing: Combinations of Sentences and Keywords

To maximize the performance of the dual encoder system, it's advisable to provide prompts that match each encoder's characteristics. Many advanced image generation tools (e.g., ComfyUI) allow the entry of different texts for each of these two encoders.

T5-XXL (Language Expert) → Natural 'Sentences'
- It’s better to describe in complete sentence form so that the model can grasp the context.
- Example: "A Japanese young woman in her 20s, with a black ponytail and wearing black framed glasses. She is sitting on a white carpet, listening to music, and staring up at the camera."
CLIP-L (Visual Matching Expert) → 'Core Keywords'
- It’s effective to list visually important core elements such as style, objects, colors, and compositions.
- Example: "japanese woman, 20s, black ponytail, black glasses, white headphones, white carpet, listening music, sitting, staring up, from above, whole body, professional real photo"

4. What About Prompts in JSON or Dictionary Form?

To systematically manage prompts, there are instances where they are written in JSON or Python Dictionary form as follows:

{
"Character":"japanese young woman in her 20's.",
"Appearance":{
"hair": "black ponytail hair",
"wearing": "black framed glasses, white wireless headphone"
},
"Action": "listening to music, sitting on white carpet",
"style": "professional real photo"
}

Simply copying this structure and inputting it identically into both the T5 and CLIP-L encoders is very inefficient and undesirable.

Why Does Effectiveness Decrease?

Mismatch with the model's learning style: Models like T5 and CLIP have learned natural sentences (texts). Programming syntax symbols such as {, }, and " may be recognized as 'noise' rather than 'language' by the models.
Contextual Disconnection: The organic contextual connection between "Character": "..." and "Action": "..." gets severed. The model perceives "character" and "action" as separate pieces of information, struggling to combine them into a natural single scene.

Correct Transformation Examples

To properly utilize the above JSON data in a dual encoder system, a 'translation' process is necessary for the following:

T5-XXL (Sentence Input):

A japanese young woman in her 20's, with a black ponytail, wearing black framed glasses and white wireless headphones. She is listening to music while sitting on a white carpet. This is a professional real photo.
CLIP-L (Keyword Input):

japanese young woman, 20's, black ponytail, black framed glasses, white wireless headphone, listening music, sitting on white carpet, professional real photo

image_from_right_prompt

5. Summary and Conclusion

CLIP is not a generator that creates images, but a 'judge' that evaluates the matching degree between text and images.
The dual encoder (T5 + CLIP-L) is a powerful method where 'language expert (T5)' and 'visual matching expert (CLIP-L)' collaborate.
To achieve the best results, it's best to provide natural sentences to T5 and core keywords to CLIP-L.
Using structured data like JSON or dictionaries directly in prompts hinders the model's understanding, so they should be converted into natural language sentences and keywords.