ERNIE-Image

ERNIE-Image 是由百度 ERNIE-Image 团队开发的开源文本到图像生成模型。它基于单流扩散 Transformer（DiT）构建，并配备了轻量级的提示增强器，可将用户的简短输入扩展为更丰富的结构化描述。凭借仅 80 亿的 DiT 参数，它在开源文本到图像模型中达到了最先进的性能。该模型的设计不仅追求强大的视觉质量，还注重实际生成场景中的可控性，在这些场景中，准确的内容呈现与美观同等重要。特别是，ERNIE-Image 在复杂指令遵循、文本渲染和结构化图像生成方面表现出色，使其非常适合商业海报、漫画、多格布局以及其他需要兼具视觉质量和精确控制的内容创作任务。它还支持广泛的视觉风格，包括写实摄影、设计导向图像以及更多风格化的美学输出。

ERNIE-Image Mosaic

突出特点：

小巧而强大：尽管仅有 80 亿参数规模，ERNIE-Image 在一系列基准测试中仍能与规模大得多的开源模型展开激烈竞争。
文本渲染：ERNIE-Image 在密集型、长篇幅及布局敏感型文本的处理上表现尤为出色，使其成为海报、信息图表、类 UI 图像及其他文本密集型视觉内容的理想选择。
指令遵循：该模型能够可靠地遵循包含多个对象、详细关系及知识密集型描述的复杂提示。
结构化生成：ERNIE-Image 特别适用于海报、漫画、故事板和多格构图等结构化视觉任务，这些任务中布局和组织至关重要。
风格覆盖：除了清晰易读的设计导向输出外，该模型还支持写实摄影和独特的风格化美学，包括更柔和、更具电影感的视觉基调。
实际部署：得益于其紧凑的规模，ERNIE-Image 可在配备 24G 显存的消费级 GPU 上运行，从而降低了研究、下游应用和模型适配的门槛。

已发布版本

ERNIE-Image：我们的SFT模型，在通常50步推理内可提供更强的通用能力和指令忠实度。

ERNIE-Image-Turbo：我们的Turbo模型，通过DMD和RL优化，仅需8步推理即可实现更快的速度和更高的美学效果。

基准测试

GENEval

模型	单个物体	两个物体	计数	颜色	位置	属性绑定	总体
ERNIE-Image（无PE）	1.0000	0.9596	0.7781	0.9282	0.8550	0.7925	0.8856
ERNIE-Image（有PE）	0.9906	0.9596	0.8187	0.8830	0.8625	0.7225	0.8728
Qwen-Image	0.9900	0.9200	0.8900	0.8800	0.7600	0.7700	0.8683
ERNIE-Image-Turbo（无PE）	1.0000	0.9621	0.7906	0.9202	0.7975	0.7300	0.8667
ERNIE-Image-Turbo（有PE）	0.9938	0.9419	0.8375	0.8351	0.7950	0.7025	0.8510
FLUX.2-klein-9B	0.9313	0.9571	0.8281	0.9149	0.7175	0.7400	0.8481
Z-Image	1.0000	0.9400	0.7800	0.9300	0.6200	0.7700	0.8400
Z-Image-Turbo	1.0000	0.9500	0.7700	0.8900	0.6500	0.6800	0.8233

OneIG-EN

模型	对齐度	文本	推理	风格	多样性	总体
Nano Banana 2.0	0.8880	0.9440	0.3340	0.4810	0.2450	0.5780
Seedream 4.5	0.8910	0.9980	0.3500	0.4340	0.2070	0.5760
ERNIE-Image（有PE）	0.8678	0.9788	0.3566	0.4309	0.2411	0.5750
Seedream 4.0	0.8920	0.9830	0.3470	0.4530	0.1910	0.5730
ERNIE-Image-Turbo（有PE）	0.8676	0.9666	0.3537	0.4191	0.2212	0.5656
ERNIE-Image（无PE）	0.8909	0.9668	0.2950	0.4471	0.1687	0.5537
Z-Image	0.8810	0.9870	0.2800	0.3870	0.1940	0.5460
Qwen-Image	0.8820	0.8910	0.3060	0.4180	0.1970	0.5390
ERNIE-Image-Turbo（无PE）	0.8795	0.9488	0.2913	0.4277	0.1232	0.5341
FLUX.2-klein-9B	0.8871	0.8657	0.3117	0.4417	0.1560	0.5324
Qwen-Image-2512	0.8760	0.9900	0.2920	0.3380	0.1510	0.5300
GLM-Image	0.8050	0.9690	0.2980	0.3530	0.2130	0.5280
Z-Image-Turbo	0.8400	0.9940	0.2980	0.3680	0.1390	0.5280

OneIG-ZH

模型	对齐度	文本	推理能力	风格	多样性	综合评分
Nano Banana 2.0	0.8430	0.9830	0.3110	0.4610	0.2360	0.5670
ERNIE-Image (w/ PE)	0.8299	0.9539	0.3056	0.4342	0.2478	0.5543
Seedream 4.0	0.8360	0.9860	0.3040	0.4430	0.2000	0.5540
Seedream 4.5	0.8320	0.9860	0.3000	0.4260	0.2130	0.5510
Qwen-Image	0.8250	0.9630	0.2670	0.4050	0.2790	0.5480
ERNIE-Image-Turbo (w/ PE)	0.8258	0.9386	0.3043	0.4208	0.2281	0.5435
Z-Image	0.7930	0.9880	0.2660	0.3860	0.2430	0.5350
ERNIE-Image (w/o PE)	0.8421	0.8979	0.2656	0.4212	0.1772	0.5208
Qwen-Image-2512	0.8230	0.9830	0.2720	0.3420	0.1570	0.5150
GLM-Image	0.7380	0.9760	0.2840	0.3350	0.2210	0.5110
Z-Image-Turbo	0.7820	0.9820	0.2760	0.3610	0.1340	0.5070
ERNIE-Image-Turbo (w/o PE)	0.8326	0.9086	0.2580	0.4002	0.1316	0.5062
FLUX.2-klein-9B	0.8201	0.4920	0.2599	0.4166	0.1625	0.4302

LongTextBench

模型	LongText-Bench-EN	LongText-Bench-ZH	平均值
Seedream 4.5	0.9890	0.9873	0.9882
ERNIE-Image (w/ PE)	0.9804	0.9661	0.9733
GLM-Image	0.9524	0.9788	0.9656
ERNIE-Image-Turbo (w/ PE)	0.9675	0.9636	0.9655
Nano Banana 2.0	0.9808	0.9491	0.9650
ERNIE-Image-Turbo (w/o PE)	0.9602	0.9675	0.9639
ERNIE-Image (w/o PE)	0.9679	0.9594	0.9636
Qwen-Image-2512	0.9561	0.9647	0.9604
Qwen-Image	0.9430	0.9460	0.9445
Z-Image	0.9350	0.9360	0.9355
Seedream 4.0	0.9214	0.9261	0.9238
Z-Image-Turbo	0.9170	0.9260	0.9215
FLUX.2-klein-9B	0.8642	0.2183	0.5413

快速开始

Diffusers

pip install git+https://github.com/huggingface/diffusers

import torch
from diffusers import ErnieImagePipeline

pipe = ErnieImagePipeline.from_pretrained(
    "Baidu/ERNIE-Image",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe(
    prompt="This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
    height=1264,
    width=848,
    num_inference_steps=50,
    guidance_scale=4.0,
    use_pe=True # use prompt enhancer
).images[0]

image.save("output.png")

SGLang

安装最新版本的 sglang：

git clone https://github.com/sgl-project/sglang.git

启动服务器：

sglang serve --model-path baidu/ERNIE-Image

发送生成请求：

curl -X POST http://localhost:30000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
    "height": 1264,
    "width": 848,
    "num_inference_steps": 50,
    "guidance_scale": 4.0,
    "use_pe": true

  }' \
  --output output.png

ERNIE-Image

ERNIE-Image Mosaic

突出特点：

小巧而强大：尽管仅有 80 亿参数规模，ERNIE-Image 在一系列基准测试中仍能与规模大得多的开源模型展开激烈竞争。
文本渲染：ERNIE-Image 在密集型、长篇幅及布局敏感型文本的处理上表现尤为出色，使其成为海报、信息图表、类 UI 图像及其他文本密集型视觉内容的理想选择。
指令遵循：该模型能够可靠地遵循包含多个对象、详细关系及知识密集型描述的复杂提示。
结构化生成：ERNIE-Image 特别适用于海报、漫画、故事板和多格构图等结构化视觉任务，这些任务中布局和组织至关重要。
风格覆盖：除了清晰易读的设计导向输出外，该模型还支持写实摄影和独特的风格化美学，包括更柔和、更具电影感的视觉基调。
实际部署：得益于其紧凑的规模，ERNIE-Image 可在配备 24G 显存的消费级 GPU 上运行，从而降低了研究、下游应用和模型适配的门槛。

已发布版本

ERNIE-Image：我们的SFT模型，在通常50步推理内可提供更强的通用能力和指令忠实度。

ERNIE-Image-Turbo：我们的Turbo模型，通过DMD和RL优化，仅需8步推理即可实现更快的速度和更高的美学效果。

基准测试

GENEval

模型	单个物体	两个物体	计数	颜色	位置	属性绑定	总体
ERNIE-Image（无PE）	1.0000	0.9596	0.7781	0.9282	0.8550	0.7925	0.8856
ERNIE-Image（有PE）	0.9906	0.9596	0.8187	0.8830	0.8625	0.7225	0.8728
Qwen-Image	0.9900	0.9200	0.8900	0.8800	0.7600	0.7700	0.8683
ERNIE-Image-Turbo（无PE）	1.0000	0.9621	0.7906	0.9202	0.7975	0.7300	0.8667
ERNIE-Image-Turbo（有PE）	0.9938	0.9419	0.8375	0.8351	0.7950	0.7025	0.8510
FLUX.2-klein-9B	0.9313	0.9571	0.8281	0.9149	0.7175	0.7400	0.8481
Z-Image	1.0000	0.9400	0.7800	0.9300	0.6200	0.7700	0.8400
Z-Image-Turbo	1.0000	0.9500	0.7700	0.8900	0.6500	0.6800	0.8233

OneIG-EN

模型	对齐度	文本	推理	风格	多样性	总体
Nano Banana 2.0	0.8880	0.9440	0.3340	0.4810	0.2450	0.5780
Seedream 4.5	0.8910	0.9980	0.3500	0.4340	0.2070	0.5760
ERNIE-Image（有PE）	0.8678	0.9788	0.3566	0.4309	0.2411	0.5750
Seedream 4.0	0.8920	0.9830	0.3470	0.4530	0.1910	0.5730
ERNIE-Image-Turbo（有PE）	0.8676	0.9666	0.3537	0.4191	0.2212	0.5656
ERNIE-Image（无PE）	0.8909	0.9668	0.2950	0.4471	0.1687	0.5537
Z-Image	0.8810	0.9870	0.2800	0.3870	0.1940	0.5460
Qwen-Image	0.8820	0.8910	0.3060	0.4180	0.1970	0.5390
ERNIE-Image-Turbo（无PE）	0.8795	0.9488	0.2913	0.4277	0.1232	0.5341
FLUX.2-klein-9B	0.8871	0.8657	0.3117	0.4417	0.1560	0.5324
Qwen-Image-2512	0.8760	0.9900	0.2920	0.3380	0.1510	0.5300
GLM-Image	0.8050	0.9690	0.2980	0.3530	0.2130	0.5280
Z-Image-Turbo	0.8400	0.9940	0.2980	0.3680	0.1390	0.5280

OneIG-ZH

模型	对齐度	文本	推理能力	风格	多样性	综合评分
Nano Banana 2.0	0.8430	0.9830	0.3110	0.4610	0.2360	0.5670
ERNIE-Image (w/ PE)	0.8299	0.9539	0.3056	0.4342	0.2478	0.5543
Seedream 4.0	0.8360	0.9860	0.3040	0.4430	0.2000	0.5540
Seedream 4.5	0.8320	0.9860	0.3000	0.4260	0.2130	0.5510
Qwen-Image	0.8250	0.9630	0.2670	0.4050	0.2790	0.5480
ERNIE-Image-Turbo (w/ PE)	0.8258	0.9386	0.3043	0.4208	0.2281	0.5435
Z-Image	0.7930	0.9880	0.2660	0.3860	0.2430	0.5350
ERNIE-Image (w/o PE)	0.8421	0.8979	0.2656	0.4212	0.1772	0.5208
Qwen-Image-2512	0.8230	0.9830	0.2720	0.3420	0.1570	0.5150
GLM-Image	0.7380	0.9760	0.2840	0.3350	0.2210	0.5110
Z-Image-Turbo	0.7820	0.9820	0.2760	0.3610	0.1340	0.5070
ERNIE-Image-Turbo (w/o PE)	0.8326	0.9086	0.2580	0.4002	0.1316	0.5062
FLUX.2-klein-9B	0.8201	0.4920	0.2599	0.4166	0.1625	0.4302

LongTextBench

模型	LongText-Bench-EN	LongText-Bench-ZH	平均值
Seedream 4.5	0.9890	0.9873	0.9882
ERNIE-Image (w/ PE)	0.9804	0.9661	0.9733
GLM-Image	0.9524	0.9788	0.9656
ERNIE-Image-Turbo (w/ PE)	0.9675	0.9636	0.9655
Nano Banana 2.0	0.9808	0.9491	0.9650
ERNIE-Image-Turbo (w/o PE)	0.9602	0.9675	0.9639
ERNIE-Image (w/o PE)	0.9679	0.9594	0.9636
Qwen-Image-2512	0.9561	0.9647	0.9604
Qwen-Image	0.9430	0.9460	0.9445
Z-Image	0.9350	0.9360	0.9355
Seedream 4.0	0.9214	0.9261	0.9238
Z-Image-Turbo	0.9170	0.9260	0.9215
FLUX.2-klein-9B	0.8642	0.2183	0.5413

快速开始

Diffusers

pip install git+https://github.com/huggingface/diffusers

import torch
from diffusers import ErnieImagePipeline

pipe = ErnieImagePipeline.from_pretrained(
    "Baidu/ERNIE-Image",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe(
    prompt="This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
    height=1264,
    width=848,
    num_inference_steps=50,
    guidance_scale=4.0,
    use_pe=True # use prompt enhancer
).images[0]

image.save("output.png")

SGLang

安装最新版本的 sglang：

git clone https://github.com/sgl-project/sglang.git

启动服务器：

sglang serve --model-path baidu/ERNIE-Image

发送生成请求：

curl -X POST http://localhost:30000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
    "height": 1264,
    "width": 848,
    "num_inference_steps": 50,
    "guidance_scale": 4.0,
    "use_pe": true

  }' \
  --output output.png

ERNIE-Image

已发布版本

基准测试

GENEval

OneIG-EN

OneIG-ZH

LongTextBench

快速开始

推荐参数

Diffusers

SGLang

ERNIE-Image

已发布版本

基准测试

GENEval

OneIG-EN

OneIG-ZH

LongTextBench

快速开始

推荐参数

Diffusers

SGLang