200 OK) always includes a credits field with your remaining balance, but the rest of the response shape varies by platform because each upstream source emits different metadata. This page documents the actual shape returned by each endpoint, plus the universal fields and helpers for building SRT/VTT.
Universal fields
These fields are present on every successful response, regardless of platform:Your remaining credit balance after this call’s deduction. On cache hits no credit is deducted, but the field still reflects your current balance.
true only when the response was served from the shared transcript cache (in which case no credit was charged). Omitted on fresh fetches.YouTube
An array of timed segment objects. Each item has
text, startMs, endMs, and startTimeText. There is no separate segments field — these objects ARE the transcript.The same content as
transcript flattened to a single string, useful when you don’t need timing.Human-readable name of the caption track that was returned (e.g.
"English").YouTube’s 11-character video ID.
Every caption track YouTube exposes for this video — useful if you want to fetch a different language. Each entry has at minimum
baseUrl, name.simpleText, languageCode, and isTranslatable.YouTube segment fields
Spoken text for this segment.
Start time in milliseconds, encoded as a string. Cast to integer for math.
End time in milliseconds, encoded as a string.
Pre-formatted human-readable timestamp, e.g.
"1:23".TikTok
The transcript as a single string with newlines separating spoken lines. No per-line timing.
Optional. Present when the upstream exposes a downloadable video. TikTok rarely provides a separate HD URL, so
hd is typically null.Joined transcript string, newline-separated.
Optional. Facebook usually provides both
sd and hd URLs.Note the plural name. Each entry has
id, shortcode, and text. Reels typically return a single entry containing the full transcript text.Instagram’s internal numeric media ID for the post.
The shortcode from the URL (the same one you submitted in the URL path, e.g.
DW3x2NpigcR).The full transcript content for this entry, as a single string — no per-line timing.
Building SRT/VTT (YouTube only)
Only the YouTube endpoint emits per-line timing, so SRT/VTT generation is a YouTube-specific operation:Why are timestamps strings?
startMs and endMs on YouTube segments are returned as strings because some upstream platforms emit microsecond-precision values that don’t fit cleanly in a 32-bit integer. Strings preserve the value losslessly across all clients. Parse to integer in your application code.