Skip to main content
A successful request (200 OK) always includes a credits field with your remaining balance, but the rest of the response shape varies by platform because each upstream source emits different metadata. This page documents the actual shape returned by each endpoint, plus the universal fields and helpers for building SRT/VTT.

Universal fields

These fields are present on every successful response, regardless of platform:
credits
integer
required
Your remaining credit balance after this call’s deduction. On cache hits no credit is deducted, but the field still reflects your current balance.
cached
boolean
true only when the response was served from the shared transcript cache (in which case no credit was charged). Omitted on fresh fetches.

YouTube

{
  "success": true,
  "type": "video",
  "url": "https://www.youtube.com/watch?v=...",
  "transcript": [
    { "text": "...", "startMs": "320", "endMs": "14580", "startTimeText": "0:00" }
  ],
  "transcript_only_text": "... ...",
  "language": "English",
  "videoId": "...",
  "captionTracks": [/* available tracks */],
  "credits": 997
}
transcript
array
required
An array of timed segment objects. Each item has text, startMs, endMs, and startTimeText. There is no separate segments field — these objects ARE the transcript.
transcript_only_text
string
The same content as transcript flattened to a single string, useful when you don’t need timing.
language
string
Human-readable name of the caption track that was returned (e.g. "English").
videoId
string
YouTube’s 11-character video ID.
captionTracks
array
Every caption track YouTube exposes for this video — useful if you want to fetch a different language. Each entry has at minimum baseUrl, name.simpleText, languageCode, and isTranslatable.

YouTube segment fields

text
string
Spoken text for this segment.
startMs
string
Start time in milliseconds, encoded as a string. Cast to integer for math.
endMs
string
End time in milliseconds, encoded as a string.
startTimeText
string
Pre-formatted human-readable timestamp, e.g. "1:23".

TikTok

{
  "transcript": "First line of dialog\nSecond line of dialog\n...",
  "videoUrls": { "sd": "https://...mp4", "hd": null, "thumbnail": "https://...jpg" },
  "credits": 996
}
transcript
string
required
The transcript as a single string with newlines separating spoken lines. No per-line timing.
videoUrls
object
Optional. Present when the upstream exposes a downloadable video. TikTok rarely provides a separate HD URL, so hd is typically null.

Facebook

{
  "transcript": "First line of dialog\nSecond line of dialog\n...",
  "videoUrls": { "sd": "https://...mp4", "hd": "https://...mp4", "thumbnail": "https://...jpg" },
  "credits": 994
}
transcript
string
required
Joined transcript string, newline-separated.
videoUrls
object
Optional. Facebook usually provides both sd and hd URLs.

Instagram

{
  "success": true,
  "transcripts": [
    {
      "id": "3870781634534573841",
      "shortcode": "DW3x2NpigcR",
      "text": "Full transcript as a single string..."
    }
  ],
  "credits": 995
}
transcripts
array
required
Note the plural name. Each entry has id, shortcode, and text. Reels typically return a single entry containing the full transcript text.
transcripts[].id
string
Instagram’s internal numeric media ID for the post.
transcripts[].shortcode
string
The shortcode from the URL (the same one you submitted in the URL path, e.g. DW3x2NpigcR).
transcripts[].text
string
The full transcript content for this entry, as a single string — no per-line timing.

Building SRT/VTT (YouTube only)

Only the YouTube endpoint emits per-line timing, so SRT/VTT generation is a YouTube-specific operation:
function toSrt(transcript) {
  return transcript.map((s, i) => {
    const start = msToSrtTime(parseInt(s.startMs, 10));
    const end = msToSrtTime(parseInt(s.endMs, 10));
    return `${i + 1}\n${start} --> ${end}\n${s.text}\n`;
  }).join("\n");
}

function msToSrtTime(ms) {
  const hours = Math.floor(ms / 3600000);
  const minutes = Math.floor((ms % 3600000) / 60000);
  const seconds = Math.floor((ms % 60000) / 1000);
  const millis = ms % 1000;
  return `${pad(hours)}:${pad(minutes)}:${pad(seconds)},${pad(millis, 3)}`;
}

function pad(n, len = 2) {
  return String(n).padStart(len, "0");
}

// Usage:
// const srt = toSrt(response.transcript);

Why are timestamps strings?

startMs and endMs on YouTube segments are returned as strings because some upstream platforms emit microsecond-precision values that don’t fit cleanly in a 32-bit integer. Strings preserve the value losslessly across all clients. Parse to integer in your application code.