
# Media

Transcode and inspect audio/video from a single API call — no media-processing account, no per-minute plan. Your agent posts a file (inline or by storage reference) and gets back a re-encoded clip, an extracted or transcoded audio track, a loudness-normalized mix, a sped-up cut, a concatenated reel, a watermarked or captioned video, an animated GIF, a single frame, an extracted subtitle file, or a metadata report. Backed by `ffmpeg`/`ffprobe` on a dedicated arm64 media worker; priced per call, not per month.

```bash
curl -X POST https://api.relaystation.ai/v1/media/thumbnail \
  -H 'Authorization: Bearer rs_live_<key>' \
  -H 'Idempotency-Key: thumb-promo-20260611' \
  -H 'Content-Type: application/json' \
  -d '{ "file": { "inline": "<base64 video>" }, "timestamp": 3, "format": "png" }'
```

The same call works on the lodestone path — no account, just a signed x402 payment instead of an API key:

```bash
curl -X POST https://api.relaystation.ai/v1/media/thumbnail \
  -H 'X-Payment: <base64 EIP-3009 authorization>' \
  -H 'Idempotency-Key: thumb-promo-20260611' \
  -H 'Content-Type: application/json' \
  -d '{ "file": { "inline": "<base64 video>" }, "timestamp": 3, "format": "png" }'
```

## The operations

13 ops, two billing shapes. The transform ops bill per MB of input (min 1 MiB); the read/grab ops are flat per call. Three ops — `concat`, `overlay`, `subtitle-burn` — take a **second input** alongside the primary file (see [the second input](#the-second-input) below).

| Route | What it does | Price |
|---|---|---|
| `POST /v1/media/probe` | Inspect metadata via ffprobe — duration, container, streams, codecs, dimensions, bitrate | $0.0002 flat |
| `POST /v1/media/subtitle-extract` | Pull an embedded subtitle track → SRT/WebVTT text sidecar | $0.0003 flat |
| `POST /v1/media/thumbnail` | Grab a single video frame at a timestamp → PNG/JPEG | $0.0003 flat |
| `POST /v1/media/trim` | Cut a time range by fast stream-copy (codecs preserved) | $0.0005 / MB |
| `POST /v1/media/audio-extract` | Strip + (re)encode the audio track (mp3/m4a/aac/wav/ogg/flac) | $0.0005 / MB |
| `POST /v1/media/audio-convert` | Standalone audio→audio re-encode (format/bitrate/sample-rate) | $0.0005 / MB |
| `POST /v1/media/loudnorm` | Normalize audio loudness to EBU R128 (LUFS/true-peak/range) | $0.0005 / MB |
| `POST /v1/media/speed` | Change playback speed (pitch-preserving by default) | $0.0005 / MB |
| `POST /v1/media/convert` | Re-encode to another container/codec (H.264/AAC video, or audio); scale/crop/rotate | $0.0005 / MB |
| `POST /v1/media/concat` | Join 2–10 clips end-to-end (concat demuxer, stream-copy) | $0.0005 / MB of the sum |
| `POST /v1/media/overlay` | Watermark an image onto a video | $0.0005 / MB of the video |
| `POST /v1/media/subtitle-burn` | Burn (rasterize) subtitles into the video via libass | $0.0005 / MB of the video |
| `POST /v1/media/gif` | Turn a video segment into a palette-optimized animated GIF | $0.0005 / MB |

## Inputs, caps, and the billing grain

Files ride the shared input convention: `{ "inline": "<base64>" }` up to 4 MB, or `{ "inputKey": "..." }` minted from `POST /v1/cputools/upload-url` up to 50 MB — when to use which is the [persistence tiers](/docs/persistence-tiers) page. Outputs come back the same way (inline when small, presigned URL when large); recipes for downloading or chaining them are in [Receiving outputs](/docs/receiving-outputs).

The synchronous window sets honest caps, all operator-tunable: inputs ≤ 50 MB (`cputools.media.max_mb` → `413 INPUT_TOO_LARGE`) and ≤ 300 s for the audio + speed ops (`cputools.media.max_seconds` → `422 DURATION_TOO_LONG`); `convert` is tighter (≤ 60 s, downscaled to 720p — the re-encode is the expensive op); `gif` is ≤ 15 s and ≤ 640 px wide (GIFs blow up fast). The ffmpeg run itself is killed at 25 s (`cputools.media.timeout_ms`) inside the ~29 s sync window. Charge-on-attempt with a safety net: an undeliverable transform (a timeout kill, an unconvertible input) throws and the wrapper **reverses the charge** — you're never billed for work that didn't ship. Every cap check above runs **before** the charge, so a rejected request is free.

### The second input

`concat`, `overlay`, and `subtitle-burn` take a second input next to the primary `file`. That second input is **itself an input-source object** — exactly the shape the primary uses:

- `concat` — a `files` array (2–10), each item `{ "inline": "<base64>" }` (≤ 4 MB) or `{ "inputKey": "..." }` (≤ 50 MB). The whole array is the input; you bill on the summed size.
- `overlay` — an `overlay` field carrying the watermark image (`{ inline }` or `{ inputKey }`).
- `subtitle-burn` — a `subtitle` field carrying the SRT/WebVTT file (`{ inline }` or `{ inputKey }`).

Each secondary input is resolved and ownership-checked the same way as the primary; mint large secondaries through `POST /v1/cputools/upload-url` first — see [persistence tiers](/docs/persistence-tiers). Outputs always come back as a single file in the uniform envelope ([Receiving outputs](/docs/receiving-outputs)).

## Input formats — sniffed, not trusted

The input format is detected by ffprobe (the container's `format_name`), never by file extension. The allowlist is the operator-tunable `cputools.media.input_formats`; an input whose sniffed format isn't on it returns a **free** `422 UNSUPPORTED_FORMAT` before any charge, and a file that isn't probeable media at all returns a free `422 MEDIA_PARSE_FAILED`. The default allowlist:

`mp4` · `mov` · `m4a` · `m4v` · `3gp` · `3g2` · `mj2` · `matroska` (mkv) · `webm` · `avi` · `flv` · `mpegts` · `mpeg` · `mpegvideo` · `asf` · `wmv` · `ogg` · `mp3` · `wav` · `flac` · `aac` · `gif`

(Matching is by ffprobe format-name token — e.g. a `.mkv` file sniffs as `matroska,webm`, and the QuickTime family sniffs as the combined `mov,mp4,m4a,3gp,3g2,mj2`.)

Each op also has an operator kill-switch (`cputools.media.<op>.enabled`); a disabled op returns a free `422 OP_DISABLED`.

## probe — the full response

`{ "file": { ... } }` is the whole request. The response carries everything ffprobe reports, under a `probe` key:

```json
{
  "probe": {
    "formatName": "mov,mp4,m4a,3gp,3g2,mj2",
    "durationSeconds": 12.43,
    "sizeBytes": 1048576,
    "bitRate": 674812,
    "hasVideo": true,
    "hasAudio": true,
    "width": 1280,
    "height": 720,
    "videoCodec": "h264",
    "audioCodec": "aac",
    "streams": [
      { "index": 0, "type": "video", "codec": "h264" },
      { "index": 1, "type": "audio", "codec": "aac" }
    ]
  }
}
```

`formatName` is ffprobe's container short-name list. `durationSeconds`, `sizeBytes`, and `bitRate` are `null` when the container doesn't report them; `width` / `height` / `videoCodec` are `null` for audio-only inputs (and `audioCodec` for silent video). `streams[]` lists every stream with its `index`, `type` (`video` / `audio` / `subtitle` / …), and `codec` (`null` when unknown).

## trim — options

Cuts by stream-copy: codecs are preserved, no re-encode, fast. Input duration ≤ 300 s.

| Option | Type | Default | Notes |
|---|---|---|---|
| `start` | number (s), required | — | 0–86400; at or past the end of the input → `422 BAD_RANGE` |
| `end` | number (s) | end of input | must be > `start`; pass at most one of `end` / `duration` |
| `duration` | number (s) | to end of input | 0.01–86400 |
| `format` | enum | input container | `mp4` `mov` `mkv` `webm` `m4a` `mp3` `wav` `ogg` `aac` `flac`; when omitted, keeps the input container if compatible, else `mp4` (video) / `mp3` (audio) |

## convert — options

The full re-encode. Video targets come back **H.264 + AAC**, downscaled to ≤ 720 px height (`cputools.media.convert.max_height`, aspect preserved), encoded with the x264 `veryfast` preset (`cputools.media.convert.preset`). Audio targets drop the video track. Input duration ≤ 60 s (`cputools.media.convert.max_seconds`).

| Option | Type | Default | Notes |
|---|---|---|---|
| `format` | enum, required | — | `mp4` `mov` `mkv` (video) · `mp3` `m4a` `aac` `wav` `ogg` `flac` (audio). A video target needs a video stream (`422 NO_VIDEO_STREAM`); an audio target needs an audio stream (`422 NO_AUDIO_STREAM`) |
| `width` | integer (px) | — | 1–7680. Force a scale width; omit `height` to preserve aspect. When omitted, the default ≤ 720 px-height downscale applies |
| `height` | integer (px) | — | 1–7680. Force a scale height; omit `width` to preserve aspect |
| `crop` | object | — | `{ w, h, x, y }` (all integers; `w`/`h` 1–7680, `x`/`y` 0–7680) — select a `w`×`h` region with its top-left at `(x, y)` |
| `rotate` | enum | — | `90` `180` `270` — turn the frame clockwise by that many degrees |

The optional scale/crop/rotate transforms compose with the default downscale; absent, `convert` behaves exactly as before (aspect-preserving downscale to the height cap).

## thumbnail — options

Needs a video stream (`422 NO_VIDEO_STREAM`).

| Option | Type | Default | Notes |
|---|---|---|---|
| `timestamp` | number (s) | `0` | 0–86400; past the input's duration → `422 BAD_TIMESTAMP`. Default tunable: `cputools.media.thumbnail.default_timestamp` |
| `format` | enum | `png` | `png` `jpeg` (`jpg` accepted). Default tunable: `cputools.media.thumbnail.format` |

## audio-extract — options

Needs an audio stream (`422 NO_AUDIO_STREAM`). Input duration ≤ 300 s.

| Option | Type | Default | Notes |
|---|---|---|---|
| `format` | enum | `mp3` | `mp3` `m4a` `aac` `wav` `ogg` `flac`. Default tunable: `cputools.media.audio.format` |
| `bitrate` | string | `192k` | e.g. `128k`, `320k` (digits + `k`); ignored for lossless targets (`wav` / `flac`). Default tunable: `cputools.media.audio.bitrate` |

## gif — options

Needs a video stream (`422 NO_VIDEO_STREAM`). Single-pass `palettegen`/`paletteuse` for quality; lanczos scaling; the GIF loops.

| Option | Type | Default | Notes |
|---|---|---|---|
| `start` | number (s) | `0` | at or past the end of the input → `422 BAD_RANGE` |
| `duration` | number (s) | `5` | 0.1–60 in the schema, capped at 15 by `cputools.media.gif.max_seconds` → `422 GIF_TOO_LONG` |
| `width` | integer (px) | `480` | 16–1920 in the schema, capped at 640 by `cputools.media.gif.max_width` → `422 GIF_TOO_WIDE`; height follows the aspect ratio |
| `fps` | integer | `15` | 1–50. Default tunable: `cputools.media.gif.fps` |

## audio-convert — options

A standalone audio→audio re-encode — change the format, bitrate, or sample rate of an audio file. (Use `audio-extract` instead when you want to pull the audio off a *video*.) Needs an audio stream (`422 NO_AUDIO_STREAM`). Input duration ≤ 300 s.

| Option | Type | Default | Notes |
|---|---|---|---|
| `format` | enum | `mp3` | `mp3` `m4a` `aac` `wav` `ogg` `flac`. Default tunable: `cputools.media.audio.format` |
| `bitrate` | string | `192k` | e.g. `128k`, `320k` (digits + `k`); ignored for lossless targets (`wav` / `flac`). Default tunable: `cputools.media.audio.bitrate` |
| `sampleRate` | integer (Hz) | source | 8000–192000 — resample to this rate; omit to keep the source rate |

## loudnorm — options

EBU R128 loudness normalization (ffmpeg `loudnorm`) — bring a clip to a consistent perceived level for podcasts, voiceover, or music. The output is **audio** (the normalized audio stream, re-encoded); any video track is dropped. Needs an audio stream (`422 NO_AUDIO_STREAM`). Input duration ≤ 300 s.

| Option | Type | Default | Notes |
|---|---|---|---|
| `ext` | enum | `mp3` | output audio container: `mp3` `m4a` `aac` `wav` `ogg` `flac`. Default tunable: `cputools.media.audio.format` |
| `i` | number (LUFS) | `-16` | integrated-loudness target, −70 to −5 |
| `tp` | number (dBTP) | `-1.5` | true-peak ceiling, −9 to 0 |
| `lra` | number (LU) | `11` | loudness range, 1 to 50 |

## speed — options

Change playback speed. Pitch is **preserved by default** (audio is time-stretched, not pitch-shifted); set `pitchCorrect: false` for a tape-style speed-and-pitch change.

| Option | Type | Default | Notes |
|---|---|---|---|
| `factor` | number | `1.0` | 0.25–4. `2` = twice as fast (half the duration); `0.5` = half speed |
| `ext` | string | input's | output container; defaults to the input's family — `mp4` for video, `mp3` for audio. `^[a-z0-9]{1,12}$` |
| `pitchCorrect` | boolean | `true` | `true` preserves pitch (atempo); `false` lets pitch shift with the speed |

## subtitle-extract — options

Pull an embedded subtitle/caption track out of a container to a text sidecar. If the requested subtitle stream doesn't exist, you get a **free** `422 NO_SUBTITLE_STREAM`. Flat-priced.

| Option | Type | Default | Notes |
|---|---|---|---|
| `format` | enum | `srt` | `srt` `vtt` (WebVTT) |
| `index` | integer | `0` | 0–63 — which subtitle stream (the first subtitle track is `0`) |

## concat — options

Join 2–10 clips end-to-end with the ffmpeg **concat demuxer** — a fast stream-copy (no re-encode). The inputs **must share the same codec and container**; a mismatched set is a `422 CONVERT_FAILED` (re-encode each via `convert` to a common format first). Billed per MB of the **summed** input size.

| Option | Type | Default | Notes |
|---|---|---|---|
| `files` | array, required | — | 2–10 input-source objects (see [the second input](#the-second-input)). Capped at `cputools.media.concat.max_inputs` (→ `422 TOO_MANY_INPUTS`); summed size capped at `cputools.media.concat.max_total_mb` (→ `413 INPUT_TOO_LARGE`) |
| `ext` | enum | first clip's | output container: `mp4` `mov` `mkv` `mp3` `m4a` `aac` `wav` `ogg` `flac` |

## overlay — options

Composite an image (a logo or watermark) onto a video. The primary `file` must be a video (`422 NO_VIDEO_STREAM`); the `overlay` is a second input carrying the image (see [the second input](#the-second-input)). The video is re-encoded. Billed per MB of the **video**. Input duration ≤ 300 s.

| Option | Type | Default | Notes |
|---|---|---|---|
| `overlay` | input-source, required | — | the watermark image, `{ inline }` or `{ inputKey }` |
| `position` | enum | `top-left` | `top-left` `top-right` `bottom-left` `bottom-right` `center` |
| `ext` | enum | the video's | output container: `mp4` `mov` `mkv` `mp3` `m4a` `aac` `wav` `ogg` `flac` |

## subtitle-burn — options

Render a subtitle file into the video frames with **libass**. The primary `file` must be a video; the `subtitle` is a second input carrying the SRT/WebVTT file (see [the second input](#the-second-input)). The video is re-encoded. Billed per MB of the **video**. Input duration ≤ 300 s.

> **Hard subs, not soft.** The captions are **burned (rasterized) into the picture**, not added as a separate, toggleable subtitle track. The result has no soft subtitle stream and the captions cannot be turned off. If you want a removable track, ship the SRT/WebVTT alongside the video instead. Rendering uses libass with its default font.

| Option | Type | Default | Notes |
|---|---|---|---|
| `subtitle` | input-source, required | — | the SRT/WebVTT file, `{ inline }` or `{ inputKey }` |
| `ext` | enum | the video's | output container: `mp4` `mov` `mkv` `mp3` `m4a` `aac` `wav` `ogg` `flac` |

## MCP tools

The same surface is callable over MCP at `https://api.relaystation.ai/mcp`: `media_probe`, `media_thumbnail`, `media_subtitle_extract`, `media_trim`, `media_audio_extract`, `media_audio_convert`, `media_loudnorm`, `media_speed`, `media_convert`, `media_concat`, `media_overlay`, `media_subtitle_burn`, and `media_gif`. Same auth, same prices as the HTTP routes. (`media_concat`, `media_overlay`, and `media_subtitle_burn` carry the second input as the `files` array / `overlay` / `subtitle` field, same input-source shape as `file`.)

## Next

[Quickstart](/docs/quickstart) · [Authentication](/docs/authentication) · [x402 wire format](/docs/x402) · [Receiving outputs](/docs/receiving-outputs) · [Persistence tiers](/docs/persistence-tiers) · [Document conversion](/docs/doc-convert) · [API reference](/api-reference) · more image + file tools at [cputools.relaystation.ai](https://cputools.relaystation.ai)
