Nojoin AI Agent Instructions

Start Here

Read ../README.md for product scope and major entry points.
Read DEVELOPMENT.md before running local commands or changing build tooling.
Read ARCHITECTURE.md before changing component boundaries or request flows.
Read DEPLOYMENT.md for Docker, GPU or CPU mode, .env setup, and remote access configuration.
Read CALENDAR.md before touching calendar OAuth or sync behavior.
Read SECURITY.md before changing auth, tokens, encryption, or exposure of sensitive data.

Project Context

Nojoin is a distributed meeting intelligence platform. The system records live meeting audio directly from supported browsers, processes the data on a GPU-enabled Docker backend, and presents insights via a Next.js web interface.

Core Philosophy: Centralized Intelligence (GPU server), Ubiquitous Access (Web), Configurable Privacy (Self-hosted with optional local-only AI).

Architecture & Patterns

Backend (FastAPI + Celery)

Service Boundary: The backend/ directory handles API requests and offloads heavy processing to Celery workers via Redis.
- Rule: API endpoints must NEVER run heavy inference (Whisper, Pyannote, LLMs) synchronously. Heavy tasks must be dispatched to Celery.
Data Access: SQLModel is used for ORM. Models are located in backend/models/.
Dependency Injection: backend.api.deps must be used for DB sessions (SessionDep) and current user (CurrentUser).
Heavy Processing:
- Location: backend/worker/tasks.py.
- Constraint: Heavy libraries (torch, whisper, pyannote) must be imported inside the task function to keep the API lightweight and ensure fast startup times.
- Pipeline: Validation -> VAD (Silero) -> Proxy Creation (Alignment) -> Transcribe (Whisper) -> Diarize (Pyannote) -> Phantom Speaker Filter -> Merge -> Voiceprint Extraction -> Deterministic Speaker Resolution (manual names, merge handling, global matches) -> Rolling Diarization Window Reconciliation (replays live-lane windows to apply speaker boundary corrections) -> Frame-level Segmentation Refinement (re-splits boundary-flagged utterances using per-frame pyannote/segmentation-3.0 probabilities; see backend/processing/segmentation_refinement.py) -> Automatic Meeting Intelligence (one provider call for unresolved speaker suggestions, title, and Markdown notes when AI is configured).
- Manual AI Flows: Automatic AI enhancement is provider-gated and no longer uses separate per-feature toggles. Manual Generate Notes remains notes-only, and manual Retry Speaker Inference remains speaker-only.
- Meeting Edge: Live guidance is a separate worker path from end-of-processing meeting intelligence. It consumes recent live transcript context plus optional user focus text and user notes, expects a strict JSON contract, and may use a provider-specific Meeting Edge live model that falls back to the provider’s main model when unset.
- Transcription Engine: The Transcribe step dispatches to a pluggable engine (backend/processing/engines/), selected by the transcription_backend config key. The normal live and final recording flow uses the same selected engine so live transcription can be reused during final processing. Whisper is the default; Parakeet and Canary (both onnx-asr, sharing the OnnxAsrEngine base) are selectable. Parakeet is much faster on supported NVIDIA systems, but trades off some accuracy and language coverage compared with Whisper. Different-engine transcription belongs to explicit manual reprocessing after Settings are changed.
- Live Transcription Latency: Browser capture uploads short WebM/Opus, Ogg/Opus, or MP4 audio segments that the worker transcodes to WAV for the live lane. The backend live lane sequence-gates those uploads, carries trailing speech across segment boundaries, and force-emits continuous speech after about 8 seconds. The recording page should show the in-flight Meeting Edge/status workspace immediately for active recordings; provisional live transcript text is intentionally not rendered there anymore.
- Live Window State: RecordingAudioWindowManifest.status is a legacy compatibility projection. Use asr_status for live/catch-up ASR coverage and diarization_status, diarization_config_hash, and diarization_window_result_id for rolling or catch-up speaker-window coverage. Do not treat legacy live_processed as diarization-complete.
- Live Speaker Assignment: The live lane uses online voice embeddings to keep stable LIVE_XX speaker labels. Short or embedding-less regions should fall back to the most recent stable live label rather than creating a new speaker per fragment. Embedding extraction uses a centred window trimmed away from segment edges to reduce noise-pickup bias. Manual speaker edits and live text edits are authoritative and must survive final processing.
- Final Live Reuse: Live/final transcript reuse must align by stable utterance identifiers or clear one-to-one time overlap. Never use equal array length or array index position as proof that a live segment maps to a final segment. Preserve ambiguous live evidence as metadata and keep final ASR/diarization output.
- Phantom Speaker Filter: Post-diarization stage (backend/processing/phantom_filter.py) that detects and reassigns segments caused by non-speech sounds (notifications, background noise). Uses heuristic detection (duration/segment count) followed by embedding-based validation. Thresholds are defined as named constants in phantom_filter.py (PHANTOM_MAX_DURATION_S, PHANTOM_MAX_SEGMENTS, PHANTOM_EMBEDDING_FLOOR, PHANTOM_MERGE_THRESHOLD).
- Speaker Identification Constants: All speaker matching thresholds are centralised in backend/processing/embedding.py. Do not hardcode threshold values elsewhere; import and reference the named constants (IDENTIFICATION_THRESHOLD, AUTO_UPDATE_THRESHOLD, MARGIN_OF_VICTORY, DRIFT_GUARD_THRESHOLD, SCAN_MATCH_THRESHOLD, UI_SHOW_MATCH_THRESHOLD, UI_STRONG_MATCH_THRESHOLD).
- PyTorch 2.6+ & Safe Globals: The project uses PyTorch 2.6+, which defaults weights_only=True in torch.load for security.
  - Issue: This blocks loading of custom classes (like pyannote.audio.core.task.Specifications and torch.torch_version.TorchVersion) from model checkpoints.
  - Solution: These classes must be explicitly added to the safe globals list using torch.serialization.add_safe_globals([...]) before loading the model. This is handled at the module level in embedding_core.py and diarize.py.
- Configuration: backend.utils.config_manager is used to handle system and user-specific settings persisted in data/config.json. Do not add parallel ad hoc config storage.

Frontend (Next.js + Zustand)

State Management: Zustand (src/lib/store.ts) is used for global UI state (navigation, selection, filters). Prop drilling should be avoided.
API Layer: All API calls MUST go through src/lib/api.ts.
- Browser authentication uses the Secure HttpOnly session cookie issued by /api/v1/login/session.
- Explicit Bearer tokens from /api/v1/login/access-token are reserved for non-browser API clients.
- Browser recording operations use the authenticated session cookie with trusted-origin and ownership checks.
- /api/v1/recordings/init creates an uploading recording for the current user. Browser segment upload, pause, resume, discard, and finalisation must preserve monotonic 0-based segment sequencing and paused-recording lock behavior.
- force_password_change is enforced server-side. Flagged users may only fetch /api/v1/users/me, update /api/v1/users/me/password, or log out until they rotate their password.
- Never put bearer tokens into URL query strings or other browser-visible locations.
Routing: The App Router (src/app/) is utilized.
Styling: Tailwind CSS is the standard styling framework.
Components: Functional components in src/components/ are preferred.

Browser Capture

Structure: Browser capture modules live under frontend/src/lib/capture/.
Platform Support: Chrome on Windows, Linux, and macOS supports shared-audio live capture. Edge, Brave, Arc, and other Chromium-family browsers support shared-audio capture on Windows and Linux, with macOS treated as best-effort. Chrome on Android and iOS supports microphone-only live capture. Firefox, Safari, and other mobile browsers are not supported for live capture.
Capture Strategy:
- getDisplayMedia captures the user-selected tab, window, or screen and its shared audio track when the browser grants one on desktop.
- getUserMedia captures the local microphone. On mobile Chrome, this is the only live capture source.
- Web Audio mixes shared audio and microphone audio, applies gain, and feeds analyser state for the live waveform. Mobile Chrome records microphone-only audio.
- MediaRecorder creates short WebM/Opus, Ogg/Opus, or MP4 audio segments that upload sequentially to /recordings/{id}/segment.
- The worker transcodes each browser segment to canonical 16 kHz, two-channel WAV before live transcription and final concatenation. Channel 0 is shared/system audio when available and channel 1 is microphone audio; ASR/VAD may consume a mono derivative from those preserved channels, but the browser-live asset itself is not mono.
Lifecycle:
- Refreshing, closing, or navigating away from the Nojoin tab during capture marks the recording PAUSED.
- A paused recording blocks new capture until the user resumes or discards it.
- Switching to another browser tab, window, or application must not pause capture.
- Retired native-helper routes should remain terminal and should not issue credentials or accept uploads.

Critical Workflows

Commands

Start Infrastructure:
- Operator deployment: copy the compose and env templates to local files, then run docker compose up -d
- CPU: docker compose up -d after removing the deploy section from docker-compose.yml
- Local source development: use the host and local-compose workflows described in docs/DEVELOPMENT.md
- Remote Access: Ensure .env is configured with the correct WEB_APP_URL.
Migrations:
- Apply: alembic upgrade head
- Create: alembic revision --autogenerate -m "message"
Backend Tests:
- Ensure the virtual environment is active (e.g., source .venv/bin/activate)
- Run: pytest or pytest backend
Frontend:
- Development: cd frontend && npm install && npm run dev
- Verification: cd frontend && npm run build
- Lint: cd frontend && npm run lint
Browser Capture Verification:
- Unit tests: cd frontend && npm run test -- --run src/lib/capture
- Manual smoke: start Nojoin in a supported desktop Chromium browser, share a meeting tab with audio enabled, verify waveform and Meeting Edge or processing-state updates, pause/resume, stop/finalize, and unsupported-browser messaging where practical. For mobile capture changes, also smoke Chrome on Android or iOS microphone-only recording with the tab open and the phone awake.

Release Workflow (Unified Lock-step)

The project uses a single Git Tag (vX.Y.Z) to trigger the server and frontend release pipeline.

Update Version: Update docs/VERSION to the new version (e.g., 0.6.0).
Commit and Tag:
- Commit the change.
- Create a tag matching the version: git tag v0.6.0
- Push the tag: git push origin v0.6.0
CI/CD Pipeline (.github/workflows/release.yml): Trigger: The push of the v* tag automatically triggers the pipeline.

Step 1: Docker Build: Builds and pushes API, Worker, and Frontend images to GHCR with tags latest and v0.6.0. The API image also embeds the resolved server version for runtime display in Settings.

Step 2: Release Metadata: Publish or update the GitHub Release for the same tag so Settings can surface release notes. Browser capture compatibility belongs in those release notes when capture behavior changes. The current workflow does not create the GitHub Release automatically.

Important:

Versioning: Strict Semantic Versioning (vX.Y.Z).
Source of Truth: The Git Tag is the single source of truth for published releases. Local source builds use docs/VERSION. The API image embeds the resolved server version at build time.
- Version Detection: The API resolves the running version from build metadata embedded into the image (NOJOIN_SERVER_VERSION and /app/.build-version), falling back to bundled or local docs/VERSION in development and test contexts. User-facing release metadata is resolved from GitHub Releases first, with GHCR tags and the GitHub raw docs/VERSION file only used as version fallbacks if release metadata is unavailable.

Code Style & Conventions

Python (Backend)

Type Hints: Mandatory for all function arguments and return values.
Imports: Group standard lib, third-party, and local imports.
Error Handling: Use HTTPException in API endpoints.

TypeScript (Frontend)

Interfaces: Define shared types in src/types/index.ts.
Strict Mode: No any.

Browser Capture (Frontend)

Keep capture lifecycle, recorder, upload, and status behavior covered by focused Vitest tests.
Use browser feature detection for capture support instead of user-agent-only checks wherever possible.
Preserve the unsupported-browser review/admin path when changing capture gating.

Quality Assurance & Build Safety

Frontend Verification:
- Rule: After ANY change to frontend code (frontend/src/**/*), a build check MUST be run to catch type errors that dev mode misses.
- Command: cd frontend && npm run build.
- Why: Next.js dev mode is lenient; production builds are strict.
Type Safety:
- Rule: When adding new data fields (e.g., to Settings or Models), update the TypeScript interfaces in frontend/src/types/index.ts FIRST.
- Rule: Do not use any unless absolutely necessary to bypass library bugs.
Import Verification:
- Rule: When refactoring or moving code, verify that all imports in dependent files are updated. Use grep_search to find usages of moved symbols.
Client-Side Safety:
- Rule: Never assume API_BASE_URL is absolute. Always use safe URL construction (e.g., new URL(path, window.location.origin)) or manually check and prepend origin to handle both relative (production) and absolute (dev) paths.

USAGE.md: End-user workflows and UI behavior.
ADMIN.md: Roles, invitations, password rotation, and admin operations.
BACKUP_RESTORE.md: Backup contents, restore behavior, and sensitivity model.
CAPTURE.md: Browser capture setup, support matrix, and troubleshooting.
PRD.md: Product intent and longer-term scope.
README.md: Documentation index by task.

Working Style

Prefer small, targeted changes that match existing patterns in the touched area.
Link to the relevant docs instead of copying large procedural sections into new files.
If a task touches auth, calendar, processing, or release behavior, read the relevant doc before editing.

Agent Interaction Rules

The Workflow Loop

REQUIREMENT: The user states a feature. You then read the AGENTS.md, DEPLOYMENT.md, PRD.md, and USAGE.md files for context.
PLANNING: Produce a detailed plan. Consider signal propagation and dependencies.
APPROVAL: Wait for user confirmation.
IMPLEMENTATION: Generate robust code. Do not delete existing functionality unless planned.
UI DUPLICATION: When modifying the Context Menu for recordings, remember that there are TWO places to update:
- frontend/src/components/RecordingCard.tsx: The main grid view.
- frontend/src/components/Sidebar.tsx: The sidebar list view.
- Failure to update both will result in inconsistent behavior.
TESTING: The user performs manual testing but the agent should run the relevant host-side build checks and rebuild any locally customised container services when needed to catch build-time errors and ensure changes are reflected in the environment. The agent should also provide detailed instructions for testing the new feature, including any necessary setup steps, expected outcomes, and edge cases to consider.
COMPLETION: Update the docs as needed.

Constraints

NO GIT COMMANDS: Never push/pull automatically. Provide text for messages.
NO EMOJIS: Keep output strict and professional.
TONE: Objective, results-oriented. No fluff.
COMMENTS: Add comments to code where the code is non-obvious. Comments should be brief, professional, and to the point with absolutely zero ‘developer thought’ or ‘developer intent’ or ‘developer reasoning’ style comments.