Nojoin Architecture Overview
This document provides a human-readable overview of how Nojoin fits together.
For product scope and longer-term feature intent, see PRD.md.
System at a Glance
Nojoin has three major parts:
- A Dockerised backend that stores data and runs processing workloads.
- A Next.js web client for browser capture, review, and administration.
- Celery worker services that transcode live browser segments and run the transcription, diarisation, speaker, and AI processing pipeline.
Core Components
Backend
The backend is responsible for:
- API endpoints.
- Authentication and authorisation.
- Recording lifecycle management.
- Background task dispatch.
- Calendar sync orchestration.
- Release metadata and system operations.
The processing-heavy work runs in Celery workers rather than inside API endpoints.
Web Client
The web client is responsible for:
- Dashboard workflows.
- Recordings workspace and transcript review.
- Speaker management.
- Notes, meeting chat, and document upload.
- User, admin, and system settings.
- Browser capture orchestration through
getDisplayMedia,getUserMedia, Web Audio mixing, MediaRecorder segmenting, sequenced upload, live waveform state, pause/resume, and finalize controls. Mobile Chrome uses the same lifecycle with a microphone-onlygetUserMediapath.
The web client is the only live capture surface. Unsupported browsers retain review, playback, admin, and settings capabilities, but cannot start live recording.
Browser Capture Stack
The browser capture stack is responsible for:
- Prompting for shared tab, window, or screen audio.
- Prompting for microphone access.
- Mixing shared audio and microphone audio in the browser on desktop, or recording microphone-only audio on mobile Chrome.
- Recording short WebM/Opus, Ogg/Opus, or MP4 audio slices and uploading them with session-cookie authentication.
- Preserving the browser-live source layout after worker transcode as 16 kHz, two-channel WAV: channel 0 is shared/system audio when available and channel 1 is microphone audio.
- Exposing analyser output to the live waveform UI.
- Moving recordings to
PAUSEDon real tab unload or app navigation away, then requiring resume or discard before another capture starts.
Recording Flow
- The browser authenticates through a Secure HttpOnly session cookie.
- From the Meet Now card, the user clicks Start Meeting in Chrome on Windows, Linux, or macOS, another supported desktop Chromium browser, or Chrome on Android/iOS for microphone-only recording.
/recordings/initcreates anUPLOADINGrecording for the current user. The same browser session is used for segment, pause, resume, discard, and finalize operations.- On desktop, the browser asks for shared tab/window/screen audio and microphone access, mixes those streams, and records short audio slices. On mobile Chrome, the browser asks for microphone access only and records microphone-only slices.
- The browser uploads segments to
/recordings/{id}/segment?sequence=Nwith monotonically increasing 0-based sequence numbers. - The worker transcodes each browser segment to 16 kHz, two-channel WAV and dispatches the live transcription lane. Channel 0 is shared/system audio when available and channel 1 is microphone audio.
- Finalisation concatenates the completed WAV segments, queues backend processing, and triggers proxy generation.
- The web client shows a live capture or processing status workspace while the job runs.
If the user refreshes, closes, or navigates away from the Nojoin tab while recording, the browser stops capture, drops only the in-memory tail, and asks the backend to mark the recording PAUSED. Uploaded segments remain available. On the next app load, Nojoin blocks new capture behind a mandatory resume-or-discard modal.
Switching focus to another tab, window, or application does not pause capture. Only a real Nojoin tab unload or an intentional Nojoin route change away from the active recording invokes the guarded pause path.
Processing Pipeline
The normal backend processing path is:
- Validation.
- VAD and audio preprocessing.
- Proxy creation for web playback.
- Transcription via a pluggable engine (Whisper by default, Parakeet or Canary via onnx-asr selectable).
- Pyannote diarisation.
- Phantom speaker filtering.
- Merge, voiceprint extraction, and deterministic speaker resolution.
- Rolling diarisation window reconciliation: completed rolling windows captured during the live lane are replayed to apply speaker boundary corrections to provisional live utterances.
- Frame-level segmentation refinement: a second boundary-quality pass using
pyannote/segmentation-3.0inspects boundary-flagged and long live-emitted utterances and re-splits them where the dense per-frame speaker activity map identifies a cleaner turn boundary than the rolling diarisation windows resolved. - Automatic meeting intelligence when an AI provider and model are configured.
- Persistence of unresolved speaker suggestions, meeting title, and Markdown meeting notes.
Manual user notes can be captured during recording or processing and are fed into both the automatic meeting-intelligence stage and the manual note-generation flow.
If AI configuration is missing, the recording still completes with transcript, diarisation, and deterministic speaker resolution intact. Automatic AI enhancement is skipped rather than failing the meeting. Manual Generate Notes and Retry Speaker Inference remain available once AI is configured.
Playback, transcript viewing, and export all operate on the full recording timeline without applying persisted trim offsets.
Live Transcription Lane
While a recording is still uploading, a secondary lane produces provisional transcript text so the web client can show progress before the full pipeline runs:
- Each segment upload endpoint dispatches a live transcription task
(
backend/processing/live_transcribe.py). - The task slices completed speech regions, transcribes them with the same
engine selected by
transcription_backendfor final processing, assigns provisional live speaker identities, writes canonical provisional utterances first, and refreshesTranscript.segmentsas a compatibility projection. VAD regions are padded and each region clip is prepended with a short rolling audio context window (live/context.wav) so the engine has acoustic run-up and word edges are not clipped; the engine output is then sliced back to the region. - The web client shows a single in-flight workspace with waveform, Meeting Edge guidance, notes, and processing visibility as soon as the recording is in flight. The page no longer exposes provisional live transcript text, even though the backend live lane still emits it internally for Meeting Edge and later processing reuse.
- Live speaker assignment uses online voice embeddings. Matching regions are
merged into stable
LIVE_XXspeaker labels; short or embedding-less regions reuse the most recent stable label instead of creating new speaker churn. Live speaker names and transcript edits made by the user are treated as authoritative. - After new live segments land, the API/worker layer best-effort dispatches a
separate
refresh_meeting_edge_task. That task builds a bounded recent transcript window, reuses the previous Meeting Edge summary as rolling context, folds in user-authored notes, optional user focus text, and linked calendar context, then requests a strict JSON response from the configured LLM provider. - Meeting Edge uses the same configured provider as the rest of Nojoin AI, but resolves a separate provider-specific live model when one is set. If no Meeting Edge model is configured for that provider, the worker falls back to the provider’s main model instead of failing the live guidance path.
Segments are numbered sequentially starting at 0 but uploaded concurrently, so the lane uses
a sequence-gated buffer. Each task reads next_expected from a per-recording
live/state.json; a task whose segment is ahead of next_expected returns
immediately (its WAV waits on disk), and only the task holding next_expected
drains the contiguous run of segments present on disk. Audio from the trailing,
not-yet-complete utterance is carried over in live/buffer.wav and joined
to the next run, so an utterance split across a segment boundary is normally
transcribed once as a whole. If speech continues past the live forced-emission
window, the lane emits the current speech region and starts a new live segment.
Browser-live audio window manifests track two independent processing lanes. The
ASR lane records whether live or catch-up ASR consumed the window audio. The
diarisation lane records rolling or catch-up speaker-window work for the active
diarisation configuration and completed window result. The legacy window
status field remains a compatibility projection; new logic should inspect the
lane-specific ASR and diarisation fields. Operator-facing recording pages now
surface only high-level recording progress plus Meeting Edge guidance while a
recording is still in flight.
The live lane is best-effort: any failure is logged, the lane still advances,
and nothing is re-raised. When the recording finalises, process_recording_task
promotes canonical live and catch-up transcript state first, fills only missing
durable spans, replays completed rolling diarisation windows when that is
sufficient, preserves authoritative user edits, and only falls back to a
whole-recording ASR or diarisation rerun when coverage is missing,
confidence remains too low, or the user explicitly requests reprocessing with a
different engine. A different transcription engine is reserved for explicit
manual reprocessing after the user changes the transcription engine in Settings.
Final processing may reuse live transcript text and source-channel speaker authority only after a stable utterance id match or a clear one-to-one time overlap match. It must not align live and final segments by array index. When a merged, split, or low-confidence span is ambiguous, final processing keeps the final ASR/diarisation output and records live evidence in alignment metadata instead of silently applying it to the wrong time span. Manual text and speaker locks remain authoritative.
Startup Canonical Cutover
The unified pipeline now assumes a container-level startup cutover for older meetings rather than a frontend-driven migration workflow.
backend/entrypoint.shruns Alembic throughbackend.startup_migrations.- The same entrypoint then runs
backend.startup_canonical_cutoverbefore the API process starts. - That cutover acquires a database advisory lock, sweeps any recordings whose
pipeline_generationmarker is still unset, and classifies each one into a backend-only compatibility state. - Successfully canonicalized historical meetings are marked
legacy_backfilledand remain viewable through the compatibility projection. - Historical meetings that were still in flight during upgrade or that cannot
be canonicalized safely are marked
legacy_reprocess_requiredand normalized for explicit reprocess instead of continuing to rely on legacy mutation paths. - Only meetings created or explicitly rebuilt through the unified pipeline are
marked
unifiedand treated as fully supported for transcript and speaker mutation flows.
Calendar Flow
- An admin configures Google and/or Microsoft OAuth credentials for the installation.
- End users connect their own accounts from the Personal settings area.
- Nojoin syncs selected calendars into stored dashboard-facing event data, including each event’s description and attendee list.
- The dashboard renders month markers, agenda items, next-event summaries, and colour-coded sources, combining synced calendar events with unlinked Nojoin recordings.
- Recordings carry a nullable
calendar_event_id; a recording is auto-linked to a confidently overlapping calendar event during processing (or linked manually), and the linked event enriches notes and speaker prompts while suppressing the recording’s standalone dashboard calendar item.
Authentication Model
Nojoin uses different auth shapes for different clients:
- Browser traffic: Secure HttpOnly session cookies. State-changing browser requests authenticated by that session must originate from the trusted Nojoin web origin, using standard
OriginorReferervalidation rather than relying only onSameSiteand CORS. - Non-browser API clients: Explicit bearer tokens.
- Browser recording operations: Session-authenticated init, segment, pause, resume, discard, and finalize calls owned by the current user.
- Legacy native-helper routes: Retired routes return structured
410 Goneresponses that point operators to CAPTURE.md.
Forced password rotation is enforced server-side. Flagged users can only reach their self-profile, password update flow, and logout until the rotation is complete.
Storage and Persistence
- PostgreSQL stores metadata, transcripts, speakers, tasks, calendar state, and user settings.
- Redis supports Celery and related queue or cache operations.
- Recordings storage holds source audio, derived proxy assets, and related files on disk.
- Config files store system-wide configuration, while sensitive material is encrypted or otherwise handled separately where appropriate.
Release Model
Nojoin follows a unified release model:
- Git tags in the form
vX.Y.Zdrive published releases. - Docker images are published to GHCR.
- The application surfaces release metadata primarily from GitHub Releases.