Case Study

AI Orchestration

Multi-provider LLM chat with scoped tool registries, MCP server integration, and an event-driven resume RAG pipeline.

The Problem

Most AI integrations are tightly coupled to a single provider. Swap OpenAI for Claude and you're rewriting the integration layer. Tools are hardcoded in the AI service, so adding a new capability means redeploying the whole thing.

Meanwhile, resume parsing that blocks the upload request creates a poor user experience. The user stares at a spinner while an LLM processes their document — or worse, the request times out.

The Solution

A provider-agnostic AI service built on Microsoft.Extensions.AI. The abstraction layer means switching between OpenAI, Claude, and Gemini is a configuration change, not a code change. Each provider is registered as a keyed IChatClient singleton.

Tools aren't hardcoded — they're discovered at runtime via the Model Context Protocol (MCP). The monolith and microservices each expose their own MCP server. The AI service connects to whichever one matches the user's current session mode, discovering available tools dynamically.

Resume processing is fully async. Upload triggers an event, a background handler downloads and parses the resume, generates embeddings, and stores them in pgvector. The user gets real-time progress updates via SignalR.

AI chat creating a company with sequence diagram showing the full saga flow across 8 services — AI assistant creating a company via function calling — the sequence diagram shows the full request flow across gateway, AI service, MCP server, and connector saga (2115ms end-to-end)

Architecture

The AI service uses a scoped tool registry pattern. Four chat scopes (SystemAdmin, Admin, CompanyAdmin, Public) each get their own tool set. A ChatOptionsFactory resolves the correct registry based on the authenticated user's role and reads the x-mode header to select the right MCP topology.

The resume pipeline is a three-stage event-driven flow: ResumeUploadedV1Event triggers download and parsing, ResumeParsedV1Event triggers embedding generation, and ResumeDeletedV1Event triggers cleanup. Each stage communicates through Dapr pub/sub on RabbitMQ.

MCP Inspector showing dynamically discovered tools from the monolith MCP server — MCP Inspector connected to the monolith server — tools like company_list, draft_job, and finalize_job are discovered at runtime via SSE transport

What You See

In the admin app, open the AI chat and ask it to create a company. Watch the conversation — you'll see the model decide which tool to call, the function invocation, and the result. Switch between monolith and microservices mode and ask the same question: the AI discovers different tools from different MCP servers.

In the public app, upload a resume. A progress indicator updates in real-time as the system downloads, parses, and embeds the document. Then ask the AI for job recommendations — it queries the pgvector embeddings to find semantic matches.

Resume parsing progress showing 3/5 sections extracted with real-time section checklist — Real-time resume processing progress streamed via SignalR — section-by-section extraction with live status updates

AI Provider Settings page showing dropdown with Azure, OpenAI, Gemini, and Claude options — AI provider configuration — switching between Azure, OpenAI, Gemini, and Claude is a dropdown change, not a code change

Behind the Scenes

The FunctionInvokingChatClient from Microsoft.Extensions.AI handles the tool-calling loop automatically. When the model returns a tool call, the middleware invokes the matching AIFunction, feeds the result back, and lets the model decide whether to call another tool or respond to the user.

MCP tool discovery happens through McpToolProvider, which connects to the backend MCP servers via SSE transport. The provider resolves tools at startup and caches them. When the user's session mode changes, a different MCP topology is selected.

JWT tokens are forwarded through AsyncLocal storage so that tool calls from the AI service authenticate against the backend APIs with the original user's identity. This means authorization rules apply consistently — a CompanyAdmin can only create jobs in their own company, even through the AI chat.

Key Decisions

MCP servers over in-process tools

Why: In-process tools couple the AI service to domain logic. MCP servers let each backend expose its own tools independently. Adding a tool to the monolith doesn't require redeploying the AI service.

Alternative: Shared NuGet package with tool definitions. Simpler but creates tight coupling.

Microsoft.Extensions.AI over direct SDK calls

Why: The abstraction lets us swap providers without changing application code. The decorator pipeline applies uniformly regardless of which provider is active.

Alternative: Semantic Kernel. More opinionated, heavier dependency.

pgvector over a dedicated vector database

Why: PostgreSQL was already in the stack. Adding the pgvector extension avoids introducing another database to operate. For portfolio scale, it performs well.

Alternative: Pinecone, Qdrant, or Weaviate. Better at scale but adds operational complexity.

Three chat scopes with policy-based auth

Why: Different user roles need different capabilities. Scoping at the chat level prevents privilege escalation through the AI.

Alternative: Single chat endpoint with runtime permission checks per tool. Simpler routing but harder to audit.

Tradeoffs & Lessons Learned

MCP adds a network hop per tool call: Each tool invocation goes from AI service to MCP server to backend API and back. The latency is acceptable (tens of milliseconds) but visible in traces.
pgvector limits: Cosine similarity search with 1536-dimension embeddings works well up to roughly a million vectors. Beyond that, you'd need approximate nearest neighbor indexes or a dedicated vector database.
Tool descriptions matter enormously: The LLM selects tools based on their descriptions. Vague descriptions cause N+1 tool calls. We saw 20x token savings by making descriptions explicit.
Redis conversation window: Chat history is stored in Redis and truncated at 40 messages. A smarter approach would summarize older messages instead of dropping them.