🤗 ERNIE-Image |
🤗 ERNIE-Image-Turbo |
🤖 ERNIE-Image |
🤖 ERNIE-Image-Turbo
🖥️ Huggingface 演示1 |
🖥️ Huggingface 演示2(ZeroGPU) |
🖥️ AI Studio 演示
Github |
📖 博客 |
🖼️ 作品画廊
💬 WeChat(微信) |
🫨 Discord |
🏷️ X
ERNIE-Image 是由百度 ERNIE-Image 团队开发的开源文本到图像生成模型。它基于单流扩散 Transformer(DiT)构建,并配备了轻量级的提示增强器,可将用户的简短输入扩展为更丰富的结构化描述。凭借仅 80 亿的 DiT 参数,它在开源文本到图像模型中达到了最先进的性能。该模型的设计不仅追求强大的视觉质量,还注重实际生成场景中的可控性,在这些场景中,准确的内容呈现与美观同等重要。特别是,ERNIE-Image 在复杂指令遵循、文本渲染和结构化图像生成方面表现出色,使其非常适合商业海报、漫画、多格布局以及其他需要兼具视觉质量和精确控制的内容创作任务。它还支持广泛的视觉风格,包括写实摄影、设计导向图像以及更多风格化的美学输出。
突出特点:
ERNIE-Image:我们的SFT模型,在通常50步推理内可提供更强的通用能力和指令忠实度。
ERNIE-Image-Turbo:我们的Turbo模型,通过DMD和RL优化,仅需8步推理即可实现更快的速度和更高的美学效果。
| 模型 | 单个物体 | 两个物体 | 计数 | 颜色 | 位置 | 属性绑定 | 总体 |
|---|---|---|---|---|---|---|---|
| ERNIE-Image(无PE) | 1.0000 | 0.9596 | 0.7781 | 0.9282 | 0.8550 | 0.7925 | 0.8856 |
| ERNIE-Image(有PE) | 0.9906 | 0.9596 | 0.8187 | 0.8830 | 0.8625 | 0.7225 | 0.8728 |
| Qwen-Image | 0.9900 | 0.9200 | 0.8900 | 0.8800 | 0.7600 | 0.7700 | 0.8683 |
| ERNIE-Image-Turbo(无PE) | 1.0000 | 0.9621 | 0.7906 | 0.9202 | 0.7975 | 0.7300 | 0.8667 |
| ERNIE-Image-Turbo(有PE) | 0.9938 | 0.9419 | 0.8375 | 0.8351 | 0.7950 | 0.7025 | 0.8510 |
| FLUX.2-klein-9B | 0.9313 | 0.9571 | 0.8281 | 0.9149 | 0.7175 | 0.7400 | 0.8481 |
| Z-Image | 1.0000 | 0.9400 | 0.7800 | 0.9300 | 0.6200 | 0.7700 | 0.8400 |
| Z-Image-Turbo | 1.0000 | 0.9500 | 0.7700 | 0.8900 | 0.6500 | 0.6800 | 0.8233 |
| 模型 | 对齐度 | 文本 | 推理 | 风格 | 多样性 | 总体 |
|---|---|---|---|---|---|---|
| Nano Banana 2.0 | 0.8880 | 0.9440 | 0.3340 | 0.4810 | 0.2450 | 0.5780 |
| Seedream 4.5 | 0.8910 | 0.9980 | 0.3500 | 0.4340 | 0.2070 | 0.5760 |
| ERNIE-Image(有PE) | 0.8678 | 0.9788 | 0.3566 | 0.4309 | 0.2411 | 0.5750 |
| Seedream 4.0 | 0.8920 | 0.9830 | 0.3470 | 0.4530 | 0.1910 | 0.5730 |
| ERNIE-Image-Turbo(有PE) | 0.8676 | 0.9666 | 0.3537 | 0.4191 | 0.2212 | 0.5656 |
| ERNIE-Image(无PE) | 0.8909 | 0.9668 | 0.2950 | 0.4471 | 0.1687 | 0.5537 |
| Z-Image | 0.8810 | 0.9870 | 0.2800 | 0.3870 | 0.1940 | 0.5460 |
| Qwen-Image | 0.8820 | 0.8910 | 0.3060 | 0.4180 | 0.1970 | 0.5390 |
| ERNIE-Image-Turbo(无PE) | 0.8795 | 0.9488 | 0.2913 | 0.4277 | 0.1232 | 0.5341 |
| FLUX.2-klein-9B | 0.8871 | 0.8657 | 0.3117 | 0.4417 | 0.1560 | 0.5324 |
| Qwen-Image-2512 | 0.8760 | 0.9900 | 0.2920 | 0.3380 | 0.1510 | 0.5300 |
| GLM-Image | 0.8050 | 0.9690 | 0.2980 | 0.3530 | 0.2130 | 0.5280 |
| Z-Image-Turbo | 0.8400 | 0.9940 | 0.2980 | 0.3680 | 0.1390 | 0.5280 |
| 模型 | 对齐度 | 文本 | 推理能力 | 风格 | 多样性 | 综合评分 |
|---|---|---|---|---|---|---|
| Nano Banana 2.0 | 0.8430 | 0.9830 | 0.3110 | 0.4610 | 0.2360 | 0.5670 |
| ERNIE-Image (w/ PE) | 0.8299 | 0.9539 | 0.3056 | 0.4342 | 0.2478 | 0.5543 |
| Seedream 4.0 | 0.8360 | 0.9860 | 0.3040 | 0.4430 | 0.2000 | 0.5540 |
| Seedream 4.5 | 0.8320 | 0.9860 | 0.3000 | 0.4260 | 0.2130 | 0.5510 |
| Qwen-Image | 0.8250 | 0.9630 | 0.2670 | 0.4050 | 0.2790 | 0.5480 |
| ERNIE-Image-Turbo (w/ PE) | 0.8258 | 0.9386 | 0.3043 | 0.4208 | 0.2281 | 0.5435 |
| Z-Image | 0.7930 | 0.9880 | 0.2660 | 0.3860 | 0.2430 | 0.5350 |
| ERNIE-Image (w/o PE) | 0.8421 | 0.8979 | 0.2656 | 0.4212 | 0.1772 | 0.5208 |
| Qwen-Image-2512 | 0.8230 | 0.9830 | 0.2720 | 0.3420 | 0.1570 | 0.5150 |
| GLM-Image | 0.7380 | 0.9760 | 0.2840 | 0.3350 | 0.2210 | 0.5110 |
| Z-Image-Turbo | 0.7820 | 0.9820 | 0.2760 | 0.3610 | 0.1340 | 0.5070 |
| ERNIE-Image-Turbo (w/o PE) | 0.8326 | 0.9086 | 0.2580 | 0.4002 | 0.1316 | 0.5062 |
| FLUX.2-klein-9B | 0.8201 | 0.4920 | 0.2599 | 0.4166 | 0.1625 | 0.4302 |
| 模型 | LongText-Bench-EN | LongText-Bench-ZH | 平均值 |
|---|---|---|---|
| Seedream 4.5 | 0.9890 | 0.9873 | 0.9882 |
| ERNIE-Image (w/ PE) | 0.9804 | 0.9661 | 0.9733 |
| GLM-Image | 0.9524 | 0.9788 | 0.9656 |
| ERNIE-Image-Turbo (w/ PE) | 0.9675 | 0.9636 | 0.9655 |
| Nano Banana 2.0 | 0.9808 | 0.9491 | 0.9650 |
| ERNIE-Image-Turbo (w/o PE) | 0.9602 | 0.9675 | 0.9639 |
| ERNIE-Image (w/o PE) | 0.9679 | 0.9594 | 0.9636 |
| Qwen-Image-2512 | 0.9561 | 0.9647 | 0.9604 |
| Qwen-Image | 0.9430 | 0.9460 | 0.9445 |
| Z-Image | 0.9350 | 0.9360 | 0.9355 |
| Seedream 4.0 | 0.9214 | 0.9261 | 0.9238 |
| Z-Image-Turbo | 0.9170 | 0.9260 | 0.9215 |
| FLUX.2-klein-9B | 0.8642 | 0.2183 | 0.5413 |
pip install git+https://github.com/huggingface/diffusers
import torch
from diffusers import ErnieImagePipeline
pipe = ErnieImagePipeline.from_pretrained(
"Baidu/ERNIE-Image",
torch_dtype=torch.bfloat16,
).to("cuda")
image = pipe(
prompt="This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
height=1264,
width=848,
num_inference_steps=50,
guidance_scale=4.0,
use_pe=True # use prompt enhancer
).images[0]
image.save("output.png")安装最新版本的 sglang:
git clone https://github.com/sgl-project/sglang.git启动服务器:
sglang serve --model-path baidu/ERNIE-Image发送生成请求:
curl -X POST http://localhost:30000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"prompt": "This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
"height": 1264,
"width": 848,
"num_inference_steps": 50,
"guidance_scale": 4.0,
"use_pe": true
}' \
--output output.png