Response schema

A successful request (200 OK) always includes a credits field with your remaining balance, but the rest of the response shape varies by platform because each upstream source emits different metadata. This page documents the actual shape returned by each endpoint, plus the universal fields and helpers for building SRT/VTT.

Universal fields

These fields are present on every successful response, regardless of platform:

credits

integer

required

Your remaining credit balance after this call’s deduction. On cache hits no credit is deducted, but the field still reflects your current balance.

cached

boolean

true only when the response was served from the shared transcript cache (in which case no credit was charged). Omitted on fresh fetches.

YouTube

{
  "success": true,
  "type": "video",
  "url": "https://www.youtube.com/watch?v=...",
  "transcript": [
    { "text": "...", "startMs": "320", "endMs": "14580", "startTimeText": "0:00" }
  ],
  "transcript_only_text": "... ...",
  "language": "English",
  "videoId": "...",
  "captionTracks": [/* available tracks */],
  "credits": 997
}

transcript

array

required

An array of timed segment objects. Each item has text, startMs, endMs, and startTimeText. There is no separate segments field — these objects ARE the transcript.

transcript_only_text

string

The same content as transcript flattened to a single string, useful when you don’t need timing.

language

string

Human-readable name of the caption track that was returned (e.g. "English").

videoId

string

YouTube’s 11-character video ID.

captionTracks

array

Every caption track YouTube exposes for this video — useful if you want to fetch a different language. Each entry has at minimum baseUrl, name.simpleText, languageCode, and isTranslatable.

YouTube segment fields

text

string

Spoken text for this segment.

startMs

string

Start time in milliseconds, encoded as a string. Cast to integer for math.

endMs

string

End time in milliseconds, encoded as a string.

startTimeText

string

Pre-formatted human-readable timestamp, e.g. "1:23".

TikTok

{
  "transcript": "First line of dialog\nSecond line of dialog\n...",
  "videoUrls": { "sd": "https://...mp4", "hd": null, "thumbnail": "https://...jpg" },
  "credits": 996
}

transcript

string

required

The transcript as a single string with newlines separating spoken lines. No per-line timing.

videoUrls

object

Optional. Present when the upstream exposes a downloadable video. TikTok rarely provides a separate HD URL, so hd is typically null.

Facebook

{
  "transcript": "First line of dialog\nSecond line of dialog\n...",
  "videoUrls": { "sd": "https://...mp4", "hd": "https://...mp4", "thumbnail": "https://...jpg" },
  "credits": 994
}

transcript

string

required

Joined transcript string, newline-separated.

videoUrls

object

Optional. Facebook usually provides both sd and hd URLs.

Instagram

{
  "success": true,
  "transcripts": [
    {
      "id": "3870781634534573841",
      "shortcode": "DW3x2NpigcR",
      "text": "Full transcript as a single string..."
    }
  ],
  "credits": 995
}

transcripts

array

required

Note the plural name. Each entry has id, shortcode, and text. Reels typically return a single entry containing the full transcript text.

transcripts[].id

string

Instagram’s internal numeric media ID for the post.

transcripts[].shortcode

string

The shortcode from the URL (the same one you submitted in the URL path, e.g. DW3x2NpigcR).

transcripts[].text

string

The full transcript content for this entry, as a single string — no per-line timing.

Building SRT/VTT (YouTube only)

Only the YouTube endpoint emits per-line timing, so SRT/VTT generation is a YouTube-specific operation:

function toSrt(transcript) {
  return transcript.map((s, i) => {
    const start = msToSrtTime(parseInt(s.startMs, 10));
    const end = msToSrtTime(parseInt(s.endMs, 10));
    return `${i + 1}\n${start} --> ${end}\n${s.text}\n`;
  }).join("\n");
}

function msToSrtTime(ms) {
  const hours = Math.floor(ms / 3600000);
  const minutes = Math.floor((ms % 3600000) / 60000);
  const seconds = Math.floor((ms % 60000) / 1000);
  const millis = ms % 1000;
  return `${pad(hours)}:${pad(minutes)}:${pad(seconds)},${pad(millis, 3)}`;
}

function pad(n, len = 2) {
  return String(n).padStart(len, "0");
}

// Usage:
// const srt = toSrt(response.transcript);

Why are timestamps strings?

startMs and endMs on YouTube segments are returned as strings because some upstream platforms emit microsecond-precision values that don’t fit cleanly in a 32-bit integer. Strings preserve the value losslessly across all clients. Parse to integer in your application code.

Endpoints

Schema

Universal fields

YouTube

YouTube segment fields

TikTok

Facebook

Instagram

Building SRT/VTT (YouTube only)

Why are timestamps strings?

Endpoints

Schema

​Universal fields

​YouTube

​YouTube segment fields

​TikTok

​Facebook

​Instagram

​Building SRT/VTT (YouTube only)

​Why are timestamps strings?

Universal fields

YouTube

YouTube segment fields

TikTok

Facebook

Instagram

Building SRT/VTT (YouTube only)

Why are timestamps strings?