Full Docs (.md)

Overview

Overview

Welcome to the Fluxions API. Our hosted endpoints cover three product surfaces:

  • Transcriptionakro-v1, our listening model: speech-to-text, speaker diarization, and non-speech events (breaths, laughter, hesitations) in one call. Production-ready today.
  • Text-to-Speech — hosted VUI for conversational TTS. Coming soon — join the waitlist.
  • Realtime Voice — OpenAI Realtime-compatible WebSocket for end-to-end streaming voice conversations. Coming soon.

This page covers the basics that apply across all surfaces: authentication, base URL, and a health check.

Authentication

All API requests require authentication using an API key. Include your API key in the Authorization header:

curl "https://api.fluxions.ai/endpoint" \
-H "Authorization: YOUR_API_KEY"

Important: Do not use the "Bearer " prefix. Include the API key directly in the Authorization header.

Base URL

https://api.fluxions.ai

GET /health — Health Check

Check the API status and version information. No authentication required.

Request

curl "https://api.fluxions.ai/health"

Response

{
"status": "ok",
"version": "1.0.0",
"model": "akro-v1"
}

Transcription

Transcription

Our akro-v1 model is a comprehensive listening model that performs:

  • Transcription — Convert speech to text with high accuracy
  • Speaker Diarization — Identify and separate different speakers ("who said what")
  • Non-Speech Detection — Capture breathing, laughter, hesitation, and other contextual sounds

This makes it ideal for transcribing meetings, interviews, podcasts, and any audio where understanding the full context matters.

All transcription endpoints require authentication — see Overview for API key setup.

POST /submit — Submit Transcription

Submit audio for processing and receive a job ID immediately. Poll /transcriptions/{id} for results including transcription, speaker diarization, and non-speech events.

Parameters

ParameterTypeDefaultDescription
non_speechbooleanfalseInclude non-speech sounds
filenamestring"audio"Name for the uploaded file
cachebooleantrueUse cached results for identical files

Request

Body: raw audio file bytes.

curl -X POST "https://api.fluxions.ai/submit" \
-H "Authorization: YOUR_API_KEY" \
-H "Content-Type: audio/mpeg" \
--data-binary @audio.mp3

Response

{
"id": 124,
"status": "submitted",
"created_at": "2025-10-24T10:35:00.000Z",
"original_audio_url": "https://...",
"query_urls": {
"get": "https://api.fluxions.ai/transcriptions/124",
"status": "https://api.fluxions.ai/transcriptions/124"
},
"cached": false
}

Workflow

  1. Submit audio via /submit and receive job ID
  2. Poll /transcriptions/{id} to check status
  3. When status is "completed", retrieve full results

GET /transcriptions/{id} — Get Transcription Results

Retrieve the full results for a specific job: transcription, speaker diarization, and non-speech events.

Parameters

ParameterTypeDefaultDescription
word_level_timestampsbooleanfalseInclude word-level timestamps in segments

Request

curl "https://api.fluxions.ai/transcriptions/124" \
-H "Authorization: YOUR_API_KEY"

Response

{
"id": 124,
"status": "completed",
"created_at": "2025-10-24T10:35:00.000Z",
"updated_at": "2025-10-24T10:35:20.000Z",
"filename": "interview.mp3",
"audio_duration": 300.0,
"audio_format": "opus",
"processing_time": 245.5,
"language": "en",
"non_speech": false,
"num_chunks": 11,
"num_segments": 25,
"num_speakers": 2,
"text": "SPEAKER_0: Yeah, let's actually start off exactly, where we initially began.\nSPEAKER_1: Sounds perfect. That makes complete sense to me.\nSPEAKER_0: So I started thinking about what if this is just a construct?",
"segments": [
{
"speaker": "0",
"text": "Yeah, let's actually start off exactly, where we initially began.",
"start": 0.86,
"end": 6.42,
"segment_idx": 0
},
{
"speaker": "1",
"text": "Sounds perfect",
"start": 6.0,
"end": 7.2,
"segment_idx": 0
},
{
"speaker": "1",
"text": "That makes complete sense to me.",
"start": 7.5,
"end": 9.8,
"segment_idx": 1
}
],
"audio_url": "https://...r2.cloudflarestorage.com/...",
"cached": true
}

Status Values

  • submitted — Job has been submitted
  • processing — Transcription in progress
  • completed — Transcription finished successfully
  • failed — Transcription failed (check error_message)

GET /transcriptions — List Transcriptions

List all transcriptions for your account.

Parameters

ParameterTypeDefaultDescription
limitinteger50Number of results per page (max: 100)
offsetinteger0Pagination offset

Request

curl "https://api.fluxions.ai/transcriptions?limit=10&offset=0" \
-H "Authorization: YOUR_API_KEY"

Response

{
"total": 150,
"limit": 10,
"offset": 0,
"transcriptions": [
{
"id": 150,
"status": "completed",
"created_at": "2025-10-24T10:40:00.000Z",
"filename": "interview.mp3",
"audio_duration": 1800.0,
"audio_format": "opus",
"processing_time": 45.2,
"num_speakers": 2,
"num_segments": 142,
"original_audio_url": "https://...",
"language": "en"
}
]
}

Response Format

Text Field

The text field contains the full transcription with speaker labels and optional non-speech events:

  • Speaker Labels: SPEAKER_0:, SPEAKER_1:, etc. prefix each speaker's utterances
  • Line Breaks: Newlines (\n) separate different speaker turns
  • Non-speech Events: When enabled, events like [breath], [pause] appear inline

Example:

SPEAKER_0: Yeah, let's start [breath] where we began.
SPEAKER_1: Sounds good. That makes sense.
SPEAKER_0: So I was thinking about [pause] what if this is a construct?

Segments Array

The segments array provides precise timing and speaker information for each utterance:

  • speaker: Speaker ID as a string ("0", "1", etc.)
  • text: The spoken text for this segment (without non-speech events)
  • start: Start time in seconds (decimal precision)
  • end: End time in seconds (decimal precision)
  • segment_idx: Sequential index for this segment

Non-Speech Events

When non_speech=true, our listening model captures various non-speech sounds and events that provide additional context to the conversation.

Common Non-Speech Sounds

EventTagDescriptionExample Usage
Breath[breath]Audible breathing sounds...end of sentence. [breath] Now this is important.
Laugh[laugh] or hahahaLaughter - can be written as text or tagged for longer laughsOh wow! hahaha [breath] that's hilarious.
Hesitation[hesitation] or [hesitate]Unclear thinking noises or mouth sounds while pausing - not specific wordsWell [hesitation] um I'm not really sure.
Pause[pause]Unnaturally long, noticeable pause (e.g., looking something up)Let me just uh... [pause] Let me look this up.
Environment[env]Background noise or environmental soundsI was thinking [env] about what you said.
Tut[tut]Tongue click or lip smack sound[tut] That's not quite right.
Sigh[sigh]Expressive exhale sound[sigh] I suppose you're right.
Sniff[sniff]Nasal inhale or sniffing sound[sniff] Something smells good in here.
Cough[cough]Coughing soundSorry, excuse me [cough] as I was saying...

Usage Notes

  • Non-speech events are placed inline with the transcribed text
  • Events appear at their natural position in the conversation flow
  • Word elongation is marked with ellipsis: um... so... I think...
  • Emphasis on words uses asterisks: I *really* think so

Speech

Speech

Hosted VUI — conversational text-to-speech and OpenAI Realtime-compatible streaming voice.

Coming soon. Read the launch post or join the waitlist.