Gemini Multimodal Media (Image/Video/Speech) Skill

1. Goals and scope

This Skill consolidates six Gemini API capabilities into reusable workflows and implementation templates:

Image generation (Nano Banana: text-to-image, image editing, multi-turn iteration)
Image understanding (caption/VQA/classification/comparison, multi-image prompts; supports inline and Files API)
Video generation (Veo 3.1: text-to-video, aspect ratio/resolution control, reference-image guidance, first/last frames, video extension, native audio)
Video understanding (upload/inline/YouTube URL; summaries, Q&A, timestamped evidence)
Speech generation (Gemini native TTS: single-speaker and multi-speaker; controllable style/accent/pace/tone)
Audio understanding (upload/inline; description, transcription, time-range transcription, token counting)

Convention: This Skill follows the official Google Gen AI SDK (Node.js/REST) as the main line; currently only Node.js/REST examples are provided. If your project already wraps other languages or frameworks, map this Skill's request structure, model selection, and I/O spec to your wrapper layer.

2. Quick routing (decide which capability to use)

Do you need to produce images?

Need to generate images from scratch or edit based on an image -> use Nano Banana image generation (see Section 5)

Do you need to understand images?

Need recognition, description, Q&A, comparison, or info extraction -> use Image understanding (see Section 6)

Do you need to produce video?

Need to generate an 8-second video (optionally with native audio) -> use Veo 3.1 video generation (see Section 7)

Do you need to understand video?

Need summaries/Q&A/segment extraction with timestamps -> use Video understanding (see Section 8)

Do you need to read text aloud?

Need controllable narration, podcast/audiobook style, etc. -> use Speech generation (TTS) (see Section 9)

Do you need to understand audio?

Need audio descriptions, transcription, time-range transcription, token counting -> use Audio understanding (see Section 10)

3. Unified engineering constraints and I/O spec (must read)

3.0 Prerequisites (dependencies and tools)

Node.js 18+ (match your project version)
Install SDK (example):

npm install @google/genai

REST examples only need curl; if you need to parse image Base64, install jq (optional).

3.1 Authentication and environment variables

Put your API key in GEMINI_API_KEY
REST requests use x-goog-api-key: $GEMINI_API_KEY

3.2 Two file input modes: Inline vs Files API

Inline (embedded bytes/Base64)

Pros: shorter call chain, good for small files.
Key constraint: total request size (text prompt + system instructions + embedded bytes) typically has a ~20MB ceiling.

Files API (upload then reference)

Pros: good for large files, reusing the same file, or multi-turn conversations.
Typical flow:
1. files.upload(...) (SDK) or POST /upload/v1beta/files (REST resumable)
2. Use file_data / file_uri in generateContent

Engineering suggestion: implement ensure_file_uri() so that when a file exceeds a threshold (for example 10-15MB warning) or is reused, you automatically route through the Files API.

3.3 Unified handling of binary media outputs

Images: usually returned as inline_data (Base64) in response parts; in the SDK use part.as_image() or decode Base64 and save as PNG/JPG.
Speech (TTS): usually returns PCM bytes (Base64); save as .pcm or wrap into .wav (commonly 24kHz, 16-bit, mono).
Video (Veo): long-running async task; poll the operation; download the file (or use the returned URI).

4. Model selection matrix (choose by scenario)

Important: model names, versions, limits, and quotas can change over time. Verify against official docs before use. Last updated: 2026-01-22.

4.1 Image generation (Nano Banana)

gemini-2.5-flash-image: optimized for speed/throughput; good for frequent, low-latency generation/editing.
gemini-3-pro-image-preview: stronger instruction following and high-fidelity text rendering; better for professional assets and complex edits.

4.2 General image/video/audio understanding

Docs use gemini-3-flash-preview for image, video, and audio understanding (choose stronger models as needed for quality/cost).

4.3 Video generation (Veo)

Example model: veo-3.1-generate-preview (generates 8-second video and can natively generate audio).

4.4 Speech generation (TTS)

Example model: gemini-2.5-flash-preview-tts (native TTS, currently in preview).

5. Image generation (Nano Banana)

5.1 Text-to-Image

SDK (Node.js) minimal template

import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-image",
  contents:
    "Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme",
});

const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
  if (part.text) console.log(part.text);
  if (part.inlineData?.data) {
    fs.writeFileSync("out.png", Buffer.from(part.inlineData.data, "base64"));
  }
}

REST (with imageConfig) minimal template

curl -s -X POST   "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent"   -H "x-goog-api-key: $GEMINI_API_KEY"   -H "Content-Type: application/json"   -d '{
    "contents":[{"parts":[{"text":"Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme"}]}],
    "generationConfig": {"imageConfig": {"aspectRatio":"16:9"}}
  }'

REST image parsing (Base64 decode)

curl -s -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent" \
  -H "x-goog-api-key: $GEMINI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"contents":[{"parts":[{"text":"A minimal studio product shot of a nano banana"}]}]}' \
  | jq -r '.candidates[0].content.parts[] | select(.inline_data) | .inline_data.data' \
  | base64 --decode > out.png

# macOS can use: base64 -D > out.png

5.2 Text-and-Image-to-Image

Use case: given an image, add/remove/modify elements, change style, color grading, etc.

SDK (Node.js) minimal template

import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const prompt =
  "Add a nano banana on the table, keep lighting consistent, cinematic tone.";
const imageBase64 = fs.readFileSync("input.png").toString("base64");

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-image",
  contents: [
    { text: prompt },
    { inlineData: { mimeType: "image/png", data: imageBase64 } },
  ],
});

const parts = response.candidates?.[0]?.content?.parts ?? [];
for (const part of parts) {
  if (part.inlineData?.data) {
    fs.writeFileSync("edited.png", Buffer.from(part.inlineData.data, "base64"));
  }
}

5.3 Multi-turn image iteration (Multi-turn editing)

Best practice: use chat for continuous iteration (for example: generate first, then "only edit a specific region/element", then "make variants in the same style").
To output mixed "text + image" results, set response_modalities to ["TEXT", "IMAGE"].

5.4 ImageConfig

You can set in generationConfig.imageConfig or the SDK config:

aspectRatio: e.g. 16:9, 1:1.
imageSize: e.g. 2K, 4K (higher resolution is usually slower/more expensive and model support can vary).

6. Image understanding (Image Understanding)

6.1 Two ways to provide input images

Inline image data: suitable for small files (total request size < 20MB).
Files API upload: better for large files or reuse across multiple requests.

6.2 Inline images (Node.js) minimal template

import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const imageBase64 = fs.readFileSync("image.jpg").toString("base64");

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: [
    { inlineData: { mimeType: "image/jpeg", data: imageBase64 } },
    { text: "Caption this image, and list any visible brands." },
  ],
});

console.log(response.text);

6.3 Upload and reference with Files API (Node.js) minimal template

import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "image.jpg" });

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: createUserContent([
    createPartFromUri(uploaded.uri, uploaded.mimeType),
    "Caption this image.",
  ]),
});

console.log(response.text);

6.4 Multi-image prompts

Append multiple images as multiple Part entries in the same contents; you can mix uploaded references and inline bytes.

7. Video generation (Veo 3.1)

7.1 Core features (must know)

Generates 8-second high-fidelity video, optionally 720p / 1080p / 4k, and supports native audio generation (dialogue, ambience, SFX).
Supports:
- Aspect ratio (16:9 / 9:16)
- Video extension (extend a generated video; typically limited to 720p)
- First/last frame control (frame-specific)
- Up to 3 reference images (image-based direction)

7.2 SDK (Node.js) minimal template: async polling + download

import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const prompt =
  "A cinematic shot of a cat astronaut walking on the moon. Include subtle wind ambience.";
let operation = await ai.models.generateVideos({
  model: "veo-3.1-generate-preview",
  prompt,
  config: { resolution: "1080p" },
});

while (!operation.done) {
  await new Promise((resolve) => setTimeout(resolve, 10_000));
  operation = await ai.operations.getVideosOperation({ operation });
}

const video = operation.response?.generatedVideos?.[0]?.video;
if (!video) throw new Error("No video returned");
await ai.files.download({ file: video, downloadPath: "out.mp4" });

7.3 REST minimal template: predictLongRunning + poll + download

Key point: Veo REST uses :predictLongRunning to return an operation name, then poll GET /v1beta/{operation_name}; once done, download from the video URI in the response.

7.4 Common controls (recommend a unified wrapper)

aspectRatio: "16:9" or "9:16"
resolution: "720p" | "1080p" | "4k" (higher resolutions are usually slower/more expensive)
When writing prompts: put dialogue in quotes; explicitly call out SFX and ambience; use cinematography language (camera position, movement, composition, lens effects, mood).
Negative constraints: if the API supports a negative prompt field, use it; otherwise list elements you do not want to see.

7.5 Important limits (engineering fallback needed)

Latency can vary from seconds to minutes; implement timeouts and retries.
Generated videos are only retained on the server for a limited time (download promptly).
Outputs include a SynthID watermark.

Polling fallback (with timeout/backoff) pseudocode

const deadline = Date.now() + 300_000; // 5 min
let sleepMs = 2000;
while (!operation.done && Date.now() < deadline) {
  await new Promise((resolve) => setTimeout(resolve, sleepMs));
  sleepMs = Math.min(Math.floor(sleepMs * 1.5), 15_000);
  operation = await ai.operations.getVideosOperation({ operation });
}
if (!operation.done) throw new Error("video generation timed out");

8. Video understanding (Video Understanding)

8.1 Video input options

Files API upload: recommended when file > 100MB, video length > ~1 minute, or you need reuse.
Inline video data: for smaller files.
Direct YouTube URL: can analyze public videos.

8.2 Files API (Node.js) minimal template

import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "sample.mp4" });

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: createUserContent([
    createPartFromUri(uploaded.uri, uploaded.mimeType),
    "Summarize this video. Provide timestamps for key events.",
  ]),
});

console.log(response.text);

8.3 Timestamp prompting strategy

Ask for segmented bullets with "(mm:ss)" timestamps.
Require "evidence with specific time ranges" and include downstream structured extraction (JSON) in the same prompt if needed.

9. Speech generation (Text-to-Speech, TTS)

9.1 Positioning

Native TTS: for "precise reading + controllable style" (podcasts, audiobooks, ad voiceover, etc.).
Distinguish from the Live API: Live API is more interactive and non-structured audio/multimodal conversation; TTS is focused on controlled narration.

9.2 Single-speaker TTS (Node.js) minimal template

import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash-preview-tts",
  contents: [{ parts: [{ text: "Say cheerfully: Have a wonderful day!" }] }],
  config: {
    responseModalities: ["AUDIO"],
    speechConfig: {
      voiceConfig: {
        prebuiltVoiceConfig: { voiceName: "Kore" },
      },
    },
  },
});

const data =
  response.candidates?.[0]?.content?.parts?.[0]?.inlineData?.data ?? "";
if (!data) throw new Error("No audio returned");
fs.writeFileSync("out.pcm", Buffer.from(data, "base64"));

9.3 Multi-speaker TTS (max 2 speakers)

Requirements:

Use multiSpeakerVoiceConfig
Each speaker name must match the dialogue labels in the prompt (e.g., Joe/Jane).

9.4 Voice options and language

voice_name supports 30 prebuilt voices (for example Zephyr, Puck, Charon, Kore, etc.).
The model can auto-detect input language and supports 24 languages (see docs for the list).

9.5 "Director notes" (strongly recommended for high-quality voice)

Provide controllable directions for style, pace, accent, etc., but avoid over-constraining.

10. Audio understanding (Audio Understanding)

10.1 Typical tasks

Describe audio content (including non-speech like birds, alarms, etc.)
Generate transcripts
Transcribe specific time ranges
Count tokens (for cost estimates/segmentation)

10.2 Files API (Node.js) minimal template

import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";

const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const uploaded = await ai.files.upload({ file: "sample.mp3" });

const response = await ai.models.generateContent({
  model: "gemini-3-flash-preview",
  contents: createUserContent([
    "Describe this audio clip.",
    createPartFromUri(uploaded.uri, uploaded.mimeType),
  ]),
});

console.log(response.text);

10.3 Key limits and engineering tips

Supports common formats: WAV/MP3/AIFF/AAC/OGG/FLAC.
Audio tokenization: about 32 tokens/second (about 1920 tokens per minute; values may change).
Total audio length per prompt is capped at 9.5 hours; multi-channel audio is downmixed; audio is resampled (see docs for exact parameters).
If total request size exceeds 20MB, you must use the Files API.

11. End-to-end examples (composition)

Example A: Image generation -> validation via understanding

Generate product images with Nano Banana (require negative space, consistent lighting).
Use image understanding for self-check: verify text clarity, brand spelling, and unsafe elements.
If not satisfied, feed the generated image into text+image editing and iterate.

Example B: Video generation -> video understanding -> narration script

Generate an 8-second shot with Veo (include dialogue or SFX).
Download and save (respect retention window).
Upload video to video understanding to produce a storyboard + timestamps + narration copy (then feed to TTS).

Example C: Audio understanding -> time-range transcription -> TTS redub

Upload meeting audio and transcribe full content.
Transcribe or summarize specific time ranges.
Use TTS to generate a "broadcast" version of the summary.

12. Compliance and risk (must follow)

Ensure you have the necessary rights to upload images/video/audio; do not generate infringing, deceptive, harassing, or harmful content.
Generated images and videos include SynthID watermarking; videos may also have regional/person-based generation constraints.
Production systems must implement timeouts, retries, failure fallbacks, and human review/post-processing for generated content.

13. Quick reference (Checklist)

[ ] Pick the right model: image generation (Flash Image / Pro Image Preview), video generation (Veo 3.1), TTS (Gemini 2.5 TTS), understanding (Gemini Flash/Pro).
[ ] Pick the right input mode: inline for small files; Files API for large/reuse.
[ ] Parse binary outputs correctly: image/audio via inline_data decode; video via operation polling + download.
[ ] For video generation: set aspectRatio / resolution, and download promptly (avoid expiration).
[ ] For TTS: set response_modalities=["AUDIO"]; max 2 speakers; speaker names must match prompt.
[ ] For audio understanding: countTokens when needed; segment long audio or use Files API.

Google Gemini Media OpenClaw Skill - ClawHub

What this skill does

Install

Full SKILL.md