ERNIE-Image-Turbo

ERNIE-Image-Turbo 是百度 ERNIE-Image 团队开发的开源文本到图像生成模型。它是 ERNIE-Image 的蒸馏版本，基于相同的单流扩散 Transformer（DiT）架构构建，旨在通过仅 8 步推理实现快速生成，并保持高保真度。在实际生成场景中，准确的内容实现与美学效果同等重要，该模型在此类场景中保持了强大的可控性。特别地，ERNIE-Image-Turbo 在复杂指令遵循、文本渲染和结构化图像生成方面表现出色，使其非常适用于海报、漫画、多格布局以及其他既要求视觉质量又需要效率的内容创作任务。它还支持广泛的视觉风格，包括写实摄影、设计导向图像和风格化美学输出。

ERNIE-Image Mosaic

主要亮点：

快速高效：作为 ERNIE-Image 的蒸馏模型，ERNIE-Image-Turbo 仅需 8 步推理即可生成高质量图像，非常适合对延迟敏感的应用场景。
文本渲染：ERNIE-Image-Turbo 在密集型、长篇幅及布局敏感型文本的处理上表现优异，是制作海报、信息图表、类 UI 图像以及其他文本密集型视觉内容的理想选择。
指令遵循：该模型能够可靠地遵循包含多个对象、详细关系描述以及知识密集型内容的复杂提示词。
结构化生成：ERNIE-Image-Turbo 擅长处理海报、漫画、故事板和多格构图等结构化视觉任务，这些任务对布局和组织要求极高。
风格覆盖：除了清晰易读的设计导向输出外，该模型还支持写实摄影和独特的风格化美学，包括更柔和、更具电影感的视觉色调。
实际部署：得益于其紧凑的模型尺寸，ERNIE-Image-Turbo 可在配备 24G 显存的消费级 GPU 上运行，降低了研究、下游应用及模型适配的门槛。

已发布版本

ERNIE-Image：我们的SFT模型，通常在50步推理内展现更强的通用能力和指令忠实度。

ERNIE-Image-Turbo：我们的Turbo模型，通过DMD和RL优化，仅需8步推理即可实现更快的速度和更高的美学效果。

性能基准

GENEval

模型	单个物体	两个物体	计数	颜色	位置	属性绑定	总体
ERNIE-Image（无PE）	1.0000	0.9596	0.7781	0.9282	0.8550	0.7925	0.8856
ERNIE-Image（有PE）	0.9906	0.9596	0.8187	0.8830	0.8625	0.7225	0.8728
Qwen-Image	0.9900	0.9200	0.8900	0.8800	0.7600	0.7700	0.8683
ERNIE-Image-Turbo（无PE）	1.0000	0.9621	0.7906	0.9202	0.7975	0.7300	0.8667
ERNIE-Image-Turbo（有PE）	0.9938	0.9419	0.8375	0.8351	0.7950	0.7025	0.8510
FLUX.2-klein-9B	0.9313	0.9571	0.8281	0.9149	0.7175	0.7400	0.8481
Z-Image	1.0000	0.9400	0.7800	0.9300	0.6200	0.7700	0.8400
Z-Image-Turbo	1.0000	0.9500	0.7700	0.8900	0.6500	0.6800	0.8233

OneIG-EN

模型	对齐度	文本	推理	风格	多样性	总体
Nano Banana 2.0	0.8880	0.9440	0.3340	0.4810	0.2450	0.5780
Seedream 4.5	0.8910	0.9980	0.3500	0.4340	0.2070	0.5760
ERNIE-Image（有PE）	0.8678	0.9788	0.3566	0.4309	0.2411	0.5750
Seedream 4.0	0.8920	0.9830	0.3470	0.4530	0.1910	0.5730
ERNIE-Image-Turbo（有PE）	0.8676	0.9666	0.3537	0.4191	0.2212	0.5656
ERNIE-Image（无PE）	0.8909	0.9668	0.2950	0.4471	0.1687	0.5537
Z-Image	0.8810	0.9870	0.2800	0.3870	0.1940	0.5460
Qwen-Image	0.8820	0.8910	0.3060	0.4180	0.1970	0.5390
ERNIE-Image-Turbo（无PE）	0.8795	0.9488	0.2913	0.4277	0.1232	0.5341
FLUX.2-klein-9B	0.8871	0.8657	0.3117	0.4417	0.1560	0.5324
Qwen-Image-2512	0.8760	0.9900	0.2920	0.3380	0.1510	0.5300
GLM-Image	0.8050	0.9690	0.2980	0.3530	0.2130	0.5280
Z-Image-Turbo	0.8400	0.9940	0.2980	0.3680	0.1390	0.5280

OneIG-ZH

模型	内容一致性	文本生成	推理能力	风格表现	多样性	综合评分
Nano Banana 2.0	0.8430	0.9830	0.3110	0.4610	0.2360	0.5670
ERNIE-Image (w/ PE)	0.8299	0.9539	0.3056	0.4342	0.2478	0.5543
Seedream 4.0	0.8360	0.9860	0.3040	0.4430	0.2000	0.5540
Seedream 4.5	0.8320	0.9860	0.3000	0.4260	0.2130	0.5510
Qwen-Image	0.8250	0.9630	0.2670	0.4050	0.2790	0.5480
ERNIE-Image-Turbo (w/ PE)	0.8258	0.9386	0.3043	0.4208	0.2281	0.5435
Z-Image	0.7930	0.9880	0.2660	0.3860	0.2430	0.5350
ERNIE-Image (w/o PE)	0.8421	0.8979	0.2656	0.4212	0.1772	0.5208
Qwen-Image-2512	0.8230	0.9830	0.2720	0.3420	0.1570	0.5150
GLM-Image	0.7380	0.9760	0.2840	0.3350	0.2210	0.5110
Z-Image-Turbo	0.7820	0.9820	0.2760	0.3610	0.1340	0.5070
ERNIE-Image-Turbo (w/o PE)	0.8326	0.9086	0.2580	0.4002	0.1316	0.5062
FLUX.2-klein-9B	0.8201	0.4920	0.2599	0.4166	0.1625	0.4302

LongTextBench

模型	LongText-Bench-EN	LongText-Bench-ZH	平均值
Seedream 4.5	0.9890	0.9873	0.9882
ERNIE-Image (w/ PE)	0.9804	0.9661	0.9733
GLM-Image	0.9524	0.9788	0.9656
ERNIE-Image-Turbo (w/ PE)	0.9675	0.9636	0.9655
Nano Banana 2.0	0.9808	0.9491	0.9650
ERNIE-Image-Turbo (w/o PE)	0.9602	0.9675	0.9639
ERNIE-Image (w/o PE)	0.9679	0.9594	0.9636
Qwen-Image-2512	0.9561	0.9647	0.9604
Qwen-Image	0.9430	0.9460	0.9445
Z-Image	0.9350	0.9360	0.9355
Seedream 4.0	0.9214	0.9261	0.9238
Z-Image-Turbo	0.9170	0.9260	0.9215
FLUX.2-klein-9B	0.8642	0.2183	0.5413

快速开始

Diffusers

安装最新版本的 diffusers：

pip install git+https://github.com/huggingface/diffusers

import torch
from diffusers import ErnieImagePipeline

pipe = ErnieImagePipeline.from_pretrained(
    "Baidu/ERNIE-Image-Turbo",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe(
    prompt="This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
    height=1264,
    width=848,
    num_inference_steps=8,
    guidance_scale=1.0,
    use_pe=True # use prompt enhancer
).images[0]

image.save("output.png")

SGLang

安装最新版本的 sglang：

git clone https://github.com/sgl-project/sglang.git

启动服务器：

sglang serve --model-path baidu/ERNIE-Image-Turbo

发送生成请求：

curl -X POST http://localhost:30000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
    "height": 1264,
    "width": 848,
    "num_inference_steps": 8,
    "guidance_scale": 1.0,
    "use_pe": true
  }' \
  --output output.png

ERNIE-Image-Turbo

ERNIE-Image Mosaic

主要亮点：

快速高效：作为 ERNIE-Image 的蒸馏模型，ERNIE-Image-Turbo 仅需 8 步推理即可生成高质量图像，非常适合对延迟敏感的应用场景。
文本渲染：ERNIE-Image-Turbo 在密集型、长篇幅及布局敏感型文本的处理上表现优异，是制作海报、信息图表、类 UI 图像以及其他文本密集型视觉内容的理想选择。
指令遵循：该模型能够可靠地遵循包含多个对象、详细关系描述以及知识密集型内容的复杂提示词。
结构化生成：ERNIE-Image-Turbo 擅长处理海报、漫画、故事板和多格构图等结构化视觉任务，这些任务对布局和组织要求极高。
风格覆盖：除了清晰易读的设计导向输出外，该模型还支持写实摄影和独特的风格化美学，包括更柔和、更具电影感的视觉色调。
实际部署：得益于其紧凑的模型尺寸，ERNIE-Image-Turbo 可在配备 24G 显存的消费级 GPU 上运行，降低了研究、下游应用及模型适配的门槛。

已发布版本

ERNIE-Image：我们的SFT模型，通常在50步推理内展现更强的通用能力和指令忠实度。

ERNIE-Image-Turbo：我们的Turbo模型，通过DMD和RL优化，仅需8步推理即可实现更快的速度和更高的美学效果。

性能基准

GENEval

模型	单个物体	两个物体	计数	颜色	位置	属性绑定	总体
ERNIE-Image（无PE）	1.0000	0.9596	0.7781	0.9282	0.8550	0.7925	0.8856
ERNIE-Image（有PE）	0.9906	0.9596	0.8187	0.8830	0.8625	0.7225	0.8728
Qwen-Image	0.9900	0.9200	0.8900	0.8800	0.7600	0.7700	0.8683
ERNIE-Image-Turbo（无PE）	1.0000	0.9621	0.7906	0.9202	0.7975	0.7300	0.8667
ERNIE-Image-Turbo（有PE）	0.9938	0.9419	0.8375	0.8351	0.7950	0.7025	0.8510
FLUX.2-klein-9B	0.9313	0.9571	0.8281	0.9149	0.7175	0.7400	0.8481
Z-Image	1.0000	0.9400	0.7800	0.9300	0.6200	0.7700	0.8400
Z-Image-Turbo	1.0000	0.9500	0.7700	0.8900	0.6500	0.6800	0.8233

OneIG-EN

模型	对齐度	文本	推理	风格	多样性	总体
Nano Banana 2.0	0.8880	0.9440	0.3340	0.4810	0.2450	0.5780
Seedream 4.5	0.8910	0.9980	0.3500	0.4340	0.2070	0.5760
ERNIE-Image（有PE）	0.8678	0.9788	0.3566	0.4309	0.2411	0.5750
Seedream 4.0	0.8920	0.9830	0.3470	0.4530	0.1910	0.5730
ERNIE-Image-Turbo（有PE）	0.8676	0.9666	0.3537	0.4191	0.2212	0.5656
ERNIE-Image（无PE）	0.8909	0.9668	0.2950	0.4471	0.1687	0.5537
Z-Image	0.8810	0.9870	0.2800	0.3870	0.1940	0.5460
Qwen-Image	0.8820	0.8910	0.3060	0.4180	0.1970	0.5390
ERNIE-Image-Turbo（无PE）	0.8795	0.9488	0.2913	0.4277	0.1232	0.5341
FLUX.2-klein-9B	0.8871	0.8657	0.3117	0.4417	0.1560	0.5324
Qwen-Image-2512	0.8760	0.9900	0.2920	0.3380	0.1510	0.5300
GLM-Image	0.8050	0.9690	0.2980	0.3530	0.2130	0.5280
Z-Image-Turbo	0.8400	0.9940	0.2980	0.3680	0.1390	0.5280

OneIG-ZH

模型	内容一致性	文本生成	推理能力	风格表现	多样性	综合评分
Nano Banana 2.0	0.8430	0.9830	0.3110	0.4610	0.2360	0.5670
ERNIE-Image (w/ PE)	0.8299	0.9539	0.3056	0.4342	0.2478	0.5543
Seedream 4.0	0.8360	0.9860	0.3040	0.4430	0.2000	0.5540
Seedream 4.5	0.8320	0.9860	0.3000	0.4260	0.2130	0.5510
Qwen-Image	0.8250	0.9630	0.2670	0.4050	0.2790	0.5480
ERNIE-Image-Turbo (w/ PE)	0.8258	0.9386	0.3043	0.4208	0.2281	0.5435
Z-Image	0.7930	0.9880	0.2660	0.3860	0.2430	0.5350
ERNIE-Image (w/o PE)	0.8421	0.8979	0.2656	0.4212	0.1772	0.5208
Qwen-Image-2512	0.8230	0.9830	0.2720	0.3420	0.1570	0.5150
GLM-Image	0.7380	0.9760	0.2840	0.3350	0.2210	0.5110
Z-Image-Turbo	0.7820	0.9820	0.2760	0.3610	0.1340	0.5070
ERNIE-Image-Turbo (w/o PE)	0.8326	0.9086	0.2580	0.4002	0.1316	0.5062
FLUX.2-klein-9B	0.8201	0.4920	0.2599	0.4166	0.1625	0.4302

LongTextBench

模型	LongText-Bench-EN	LongText-Bench-ZH	平均值
Seedream 4.5	0.9890	0.9873	0.9882
ERNIE-Image (w/ PE)	0.9804	0.9661	0.9733
GLM-Image	0.9524	0.9788	0.9656
ERNIE-Image-Turbo (w/ PE)	0.9675	0.9636	0.9655
Nano Banana 2.0	0.9808	0.9491	0.9650
ERNIE-Image-Turbo (w/o PE)	0.9602	0.9675	0.9639
ERNIE-Image (w/o PE)	0.9679	0.9594	0.9636
Qwen-Image-2512	0.9561	0.9647	0.9604
Qwen-Image	0.9430	0.9460	0.9445
Z-Image	0.9350	0.9360	0.9355
Seedream 4.0	0.9214	0.9261	0.9238
Z-Image-Turbo	0.9170	0.9260	0.9215
FLUX.2-klein-9B	0.8642	0.2183	0.5413

快速开始

Diffusers

安装最新版本的 diffusers：

pip install git+https://github.com/huggingface/diffusers

import torch
from diffusers import ErnieImagePipeline

pipe = ErnieImagePipeline.from_pretrained(
    "Baidu/ERNIE-Image-Turbo",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe(
    prompt="This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
    height=1264,
    width=848,
    num_inference_steps=8,
    guidance_scale=1.0,
    use_pe=True # use prompt enhancer
).images[0]

image.save("output.png")

SGLang

安装最新版本的 sglang：

git clone https://github.com/sgl-project/sglang.git

启动服务器：

sglang serve --model-path baidu/ERNIE-Image-Turbo

发送生成请求：

curl -X POST http://localhost:30000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
    "height": 1264,
    "width": 848,
    "num_inference_steps": 8,
    "guidance_scale": 1.0,
    "use_pe": true
  }' \
  --output output.png

ERNIE-Image-Turbo

已发布版本

性能基准

GENEval

OneIG-EN

OneIG-ZH

LongTextBench

快速开始

推荐参数

Diffusers

SGLang

ERNIE-Image-Turbo

已发布版本

性能基准

GENEval

OneIG-EN

OneIG-ZH

LongTextBench

快速开始

推荐参数

Diffusers

SGLang